magiccodingman commited on
Commit
94a426d
·
verified ·
1 Parent(s): f822bc4

File name changes

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +9 -0
  2. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json +44 -0
  3. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md +11 -0
  4. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log +152 -0
  5. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log +152 -0
  6. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log +152 -0
  7. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/bench_metrics.json +44 -0
  8. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/llamabench.md +11 -0
  9. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_code.log +152 -0
  10. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_general.log +152 -0
  11. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_math.log +152 -0
  12. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/bench_metrics.json +44 -0
  13. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md +11 -0
  14. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log +152 -0
  15. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log +152 -0
  16. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_math.log +152 -0
  17. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/bench_metrics.json +44 -0
  18. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/llamabench.md +11 -0
  19. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_code.log +153 -0
  20. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_general.log +153 -0
  21. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_math.log +153 -0
  22. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
  23. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
  24. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +152 -0
  25. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +152 -0
  26. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +152 -0
  27. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
  28. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
  29. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +152 -0
  30. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +152 -0
  31. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +152 -0
  32. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
  33. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
  34. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +152 -0
  35. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +152 -0
  36. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +152 -0
  37. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json +44 -0
  38. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md +11 -0
  39. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log +152 -0
  40. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log +152 -0
  41. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log +152 -0
  42. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/bench_metrics.json +44 -0
  43. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md +11 -0
  44. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_code.log +151 -0
  45. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_general.log +151 -0
  46. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_math.log +151 -0
  47. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
  48. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
  49. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +151 -0
  50. Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +151 -0
.gitattributes CHANGED
@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ thinking_budget.png filter=lfs diff=lfs merge=lfs -text
38
+ Seed-OSS-36B-Instruct-MXFP4_MOE.gguf filter=lfs diff=lfs merge=lfs -text
39
+ Seed-OSS-36B-Instruct-mxfp4_moe-E-B16-D-IQ4NL-KOU-Q6K-HQ-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
40
+ Seed-OSS-36B-Instruct-mxfp4_moe-EHQKOUD-IQ4NL.gguf filter=lfs diff=lfs merge=lfs -text
41
+ Seed-OSS-36B-Instruct-mxfp4_moe-EHQKOUD-Q6K.gguf filter=lfs diff=lfs merge=lfs -text
42
+ Seed-OSS-36B-Instruct-mxfp4_moe-HK-B16-EO-Q5K-QUD-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
43
+ Seed-OSS-36B-Instruct-mxfp4_moe-HQKU-IQ4NL-EOD-MXFP4.gguf filter=lfs diff=lfs merge=lfs -text
44
+ Seed-OSS-36B-Instruct-mxfp4_moe-O-MXFP4-EHQKUD-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "raw_metrics": {
3
+ "llamabench": {
4
+ "backend": "CUDA",
5
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md",
6
+ "ngl": "35",
7
+ "raw_row": {
8
+ "backend": "CUDA",
9
+ "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
10
+ "ngl": "35",
11
+ "params": "36.15 B",
12
+ "size": "19.43 GiB",
13
+ "t/s": "30.53 \u00b1 0.74",
14
+ "test": "pp8",
15
+ "tps_value": 30.53
16
+ },
17
+ "test": "pp8",
18
+ "tps": 30.53
19
+ },
20
+ "perplexity": {
21
+ "code": {
22
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log",
23
+ "ppl": 1.4176,
24
+ "ppl_error": 0.00953
25
+ },
26
+ "general": {
27
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log",
28
+ "ppl": 6.8507,
29
+ "ppl_error": 0.16499
30
+ },
31
+ "math": {
32
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log",
33
+ "ppl": 5.4384,
34
+ "ppl_error": 0.1198
35
+ }
36
+ }
37
+ },
38
+ "summary": {
39
+ "avg_prec_loss_pct": 0.3254,
40
+ "bench_tps": 30.53,
41
+ "file_size_bytes": 20864981792,
42
+ "file_size_gb": 19.43
43
+ }
44
+ }
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.43 GiB | 36.15 B | CUDA | 35 | pp8 | 30.53 ± 0.74 |
9
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.43 GiB | 36.15 B | CUDA | 35 | tg128 | 5.08 ± 0.02 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20468 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q5_K: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.43 GiB (4.62 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
106
+ load_tensors: CUDA0 model buffer size = 2960.23 MiB
107
+ load_tensors: CUDA1 model buffer size = 2960.23 MiB
108
+ ..................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 833.78 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 113.066 ms
139
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.64 seconds per pass - ETA 5.30 minutes
141
+ [1]1.5663,[2]1.4687,[3]1.2922,[4]1.2374,[5]1.1926,[6]1.2795,[7]1.3859,[8]1.4450,[9]1.4272,[10]1.4042,[11]1.3816,[12]1.3867,[13]1.3871,[14]1.3725,[15]1.3537,[16]1.3689,[17]1.3703,[18]1.3515,[19]1.3491,[20]1.3652,[21]1.3554,[22]1.3451,[23]1.3557,[24]1.3503,[25]1.3534,[26]1.3494,[27]1.3664,[28]1.3717,[29]1.3721,[30]1.3730,[31]1.3704,[32]1.3812,[33]1.3819,[34]1.3743,[35]1.3700,[36]1.3651,[37]1.3731,[38]1.3820,[39]1.3733,[40]1.3951,[41]1.4041,[42]1.4071,[43]1.4154,[44]1.4164,[45]1.4097,[46]1.4126,[47]1.4164,[48]1.4176,
142
+ Final estimate: PPL = 1.4176 +/- 0.00953
143
+
144
+ llama_perf_context_print: load time = 2579.70 ms
145
+ llama_perf_context_print: prompt eval time = 306842.14 ms / 98304 tokens ( 3.12 ms per token, 320.37 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 308603.02 ms / 98305 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16342 + ( 3874 = 2960 + 80 + 833) + 3898 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20148 + ( 3234 = 2960 + 80 + 194) + 741 |
152
+ llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20468 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q5_K: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.43 GiB (4.62 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
106
+ load_tensors: CUDA0 model buffer size = 2960.23 MiB
107
+ load_tensors: CUDA1 model buffer size = 2960.23 MiB
108
+ ..................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 833.78 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 50.772 ms
139
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.57 seconds per pass - ETA 1.63 minutes
141
+ [1]6.9757,[2]8.0597,[3]8.4705,[4]8.2112,[5]8.0009,[6]6.7181,[7]5.9191,[8]5.9832,[9]6.2474,[10]6.3077,[11]6.4380,[12]6.7394,[13]6.7657,[14]6.8428,[15]6.8507,
142
+ Final estimate: PPL = 6.8507 +/- 0.16499
143
+
144
+ llama_perf_context_print: load time = 2555.14 ms
145
+ llama_perf_context_print: prompt eval time = 95523.06 ms / 30720 tokens ( 3.11 ms per token, 321.60 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 96021.46 ms / 30721 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16389 + ( 3874 = 2960 + 80 + 833) + 3851 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20148 + ( 3234 = 2960 + 80 + 194) + 741 |
152
+ llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20416 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q5_K: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.43 GiB (4.62 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
106
+ load_tensors: CUDA0 model buffer size = 2960.23 MiB
107
+ load_tensors: CUDA1 model buffer size = 2960.23 MiB
108
+ ..................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 833.78 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 46.827 ms
139
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.63 seconds per pass - ETA 1.77 minutes
141
+ [1]2.7608,[2]2.9054,[3]3.3323,[4]3.5844,[5]4.0966,[6]4.3685,[7]4.5745,[8]4.7023,[9]4.8500,[10]5.0005,[11]5.0791,[12]5.1544,[13]5.2872,[14]5.3968,[15]5.4247,[16]5.4384,
142
+ Final estimate: PPL = 5.4384 +/- 0.11980
143
+
144
+ llama_perf_context_print: load time = 2540.69 ms
145
+ llama_perf_context_print: prompt eval time = 102170.30 ms / 32768 tokens ( 3.12 ms per token, 320.72 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 102686.99 ms / 32769 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16345 + ( 3874 = 2960 + 80 + 833) + 3896 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20148 + ( 3234 = 2960 + 80 + 194) + 741 |
152
+ llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/bench_metrics.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "raw_metrics": {
3
+ "llamabench": {
4
+ "backend": "CUDA",
5
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/llamabench.md",
6
+ "ngl": "35",
7
+ "raw_row": {
8
+ "backend": "CUDA",
9
+ "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
10
+ "ngl": "35",
11
+ "params": "36.15 B",
12
+ "size": "19.94 GiB",
13
+ "t/s": "25.00 \u00b1 2.73",
14
+ "test": "pp8",
15
+ "tps_value": 25.0
16
+ },
17
+ "test": "pp8",
18
+ "tps": 25.0
19
+ },
20
+ "perplexity": {
21
+ "code": {
22
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_code.log",
23
+ "ppl": 1.4162,
24
+ "ppl_error": 0.00952
25
+ },
26
+ "general": {
27
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_general.log",
28
+ "ppl": 6.8281,
29
+ "ppl_error": 0.16452
30
+ },
31
+ "math": {
32
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_math.log",
33
+ "ppl": 5.442,
34
+ "ppl_error": 0.11987
35
+ }
36
+ }
37
+ },
38
+ "summary": {
39
+ "avg_prec_loss_pct": 0.3797,
40
+ "bench_tps": 25.0,
41
+ "file_size_bytes": 21416119072,
42
+ "file_size_gb": 19.95
43
+ }
44
+ }
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/llamabench.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.94 GiB | 36.15 B | CUDA | 35 | pp8 | 25.00 ± 2.73 |
9
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.94 GiB | 36.15 B | CUDA | 35 | tg128 | 4.99 ± 0.01 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_code.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20469 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q6_K: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.94 GiB (4.74 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 14364.72 MiB
106
+ load_tensors: CUDA0 model buffer size = 3026.64 MiB
107
+ load_tensors: CUDA1 model buffer size = 3026.64 MiB
108
+ .................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 934.39 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 111.885 ms
139
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.75 seconds per pass - ETA 5.38 minutes
141
+ [1]1.5604,[2]1.4661,[3]1.2906,[4]1.2351,[5]1.1912,[6]1.2788,[7]1.3853,[8]1.4444,[9]1.4265,[10]1.4033,[11]1.3807,[12]1.3857,[13]1.3862,[14]1.3715,[15]1.3527,[16]1.3679,[17]1.3692,[18]1.3505,[19]1.3481,[20]1.3641,[21]1.3544,[22]1.3441,[23]1.3545,[24]1.3490,[25]1.3521,[26]1.3479,[27]1.3652,[28]1.3705,[29]1.3711,[30]1.3719,[31]1.3692,[32]1.3801,[33]1.3807,[34]1.3732,[35]1.3689,[36]1.3638,[37]1.3718,[38]1.3806,[39]1.3719,[40]1.3937,[41]1.4027,[42]1.4057,[43]1.4140,[44]1.4151,[45]1.4084,[46]1.4113,[47]1.4151,[48]1.4162,
142
+ Final estimate: PPL = 1.4162 +/- 0.00952
143
+
144
+ llama_perf_context_print: load time = 2551.49 ms
145
+ llama_perf_context_print: prompt eval time = 312027.57 ms / 98304 tokens ( 3.17 ms per token, 315.05 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 313585.58 ms / 98305 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16234 + ( 4041 = 3026 + 80 + 934) + 3839 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20082 + ( 3300 = 3026 + 80 + 194) + 740 |
152
+ llama_memory_breakdown_print: | - Host | 14730 = 14364 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_general.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20465 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q6_K: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.94 GiB (4.74 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 14364.72 MiB
106
+ load_tensors: CUDA0 model buffer size = 3026.64 MiB
107
+ load_tensors: CUDA1 model buffer size = 3026.64 MiB
108
+ .................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 934.39 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 46.269 ms
139
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.73 seconds per pass - ETA 1.67 minutes
141
+ [1]6.9699,[2]8.0356,[3]8.4251,[4]8.1681,[5]7.9547,[6]6.6787,[7]5.8897,[8]5.9520,[9]6.2160,[10]6.2771,[11]6.4095,[12]6.7141,[13]6.7415,[14]6.8193,[15]6.8281,
142
+ Final estimate: PPL = 6.8281 +/- 0.16452
143
+
144
+ llama_perf_context_print: load time = 2816.56 ms
145
+ llama_perf_context_print: prompt eval time = 97428.48 ms / 30720 tokens ( 3.17 ms per token, 315.31 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 97912.35 ms / 30721 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16225 + ( 4041 = 3026 + 80 + 934) + 3848 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20082 + ( 3300 = 3026 + 80 + 194) + 740 |
152
+ llama_memory_breakdown_print: | - Host | 14730 = 14364 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_math.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20474 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q6_K: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.94 GiB (4.74 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 14364.72 MiB
106
+ load_tensors: CUDA0 model buffer size = 3026.64 MiB
107
+ load_tensors: CUDA1 model buffer size = 3026.64 MiB
108
+ .................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 934.39 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 44.395 ms
139
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.77 seconds per pass - ETA 1.80 minutes
141
+ [1]2.7756,[2]2.9075,[3]3.3369,[4]3.5882,[5]4.0989,[6]4.3711,[7]4.5809,[8]4.7081,[9]4.8552,[10]5.0064,[11]5.0821,[12]5.1599,[13]5.2922,[14]5.4012,[15]5.4277,[16]5.4420,
142
+ Final estimate: PPL = 5.4420 +/- 0.11987
143
+
144
+ llama_perf_context_print: load time = 2590.28 ms
145
+ llama_perf_context_print: prompt eval time = 104259.46 ms / 32768 tokens ( 3.18 ms per token, 314.29 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 104865.05 ms / 32769 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16229 + ( 4041 = 3026 + 80 + 934) + 3844 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20082 + ( 3300 = 3026 + 80 + 194) + 740 |
152
+ llama_memory_breakdown_print: | - Host | 14730 = 14364 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/bench_metrics.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "raw_metrics": {
3
+ "llamabench": {
4
+ "backend": "CUDA",
5
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md",
6
+ "ngl": "35",
7
+ "raw_row": {
8
+ "backend": "CUDA",
9
+ "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
10
+ "ngl": "35",
11
+ "params": "36.15 B",
12
+ "size": "20.88 GiB",
13
+ "t/s": "25.81 \u00b1 2.12",
14
+ "test": "pp8",
15
+ "tps_value": 25.81
16
+ },
17
+ "test": "pp8",
18
+ "tps": 25.81
19
+ },
20
+ "perplexity": {
21
+ "code": {
22
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log",
23
+ "ppl": 1.4161,
24
+ "ppl_error": 0.00951
25
+ },
26
+ "general": {
27
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log",
28
+ "ppl": 6.822,
29
+ "ppl_error": 0.16425
30
+ },
31
+ "math": {
32
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_math.log",
33
+ "ppl": 5.4388,
34
+ "ppl_error": 0.11973
35
+ }
36
+ }
37
+ },
38
+ "summary": {
39
+ "avg_prec_loss_pct": 0.4265,
40
+ "bench_tps": 25.81,
41
+ "file_size_bytes": 22421134112,
42
+ "file_size_gb": 20.88
43
+ }
44
+ }
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 20.88 GiB | 36.15 B | CUDA | 35 | pp8 | 25.81 ± 2.12 |
9
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 20.88 GiB | 36.15 B | CUDA | 35 | tg128 | 4.76 ± 0.00 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20719 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 20.88 GiB (4.96 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 15080.99 MiB
106
+ load_tensors: CUDA0 model buffer size = 3147.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 3147.73 MiB
108
+ .................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 1117.84 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 113.425 ms
139
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.95 seconds per pass - ETA 5.55 minutes
141
+ [1]1.5621,[2]1.4658,[3]1.2905,[4]1.2351,[5]1.1909,[6]1.2778,[7]1.3842,[8]1.4435,[9]1.4255,[10]1.4025,[11]1.3801,[12]1.3853,[13]1.3858,[14]1.3712,[15]1.3524,[16]1.3676,[17]1.3689,[18]1.3502,[19]1.3478,[20]1.3637,[21]1.3540,[22]1.3436,[23]1.3541,[24]1.3486,[25]1.3516,[26]1.3475,[27]1.3646,[28]1.3700,[29]1.3705,[30]1.3714,[31]1.3688,[32]1.3796,[33]1.3802,[34]1.3727,[35]1.3684,[36]1.3634,[37]1.3714,[38]1.3803,[39]1.3716,[40]1.3934,[41]1.4025,[42]1.4056,[43]1.4139,[44]1.4150,[45]1.4082,[46]1.4111,[47]1.4149,[48]1.4161,
142
+ Final estimate: PPL = 1.4161 +/- 0.00951
143
+
144
+ llama_perf_context_print: load time = 2635.78 ms
145
+ llama_perf_context_print: prompt eval time = 320424.15 ms / 98304 tokens ( 3.26 ms per token, 306.79 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 321933.86 ms / 98305 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 15877 + ( 4345 = 3147 + 80 + 1117) + 3891 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19962 + ( 3421 = 3147 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 15447 = 15080 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20718 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 20.88 GiB (4.96 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 15080.99 MiB
106
+ load_tensors: CUDA0 model buffer size = 3147.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 3147.73 MiB
108
+ .................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 1117.84 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 48.519 ms
139
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.90 seconds per pass - ETA 1.72 minutes
141
+ [1]6.9907,[2]8.0518,[3]8.4369,[4]8.1699,[5]7.9557,[6]6.6757,[7]5.8864,[8]5.9502,[9]6.2128,[10]6.2733,[11]6.4054,[12]6.7076,[13]6.7338,[14]6.8125,[15]6.8220,
142
+ Final estimate: PPL = 6.8220 +/- 0.16425
143
+
144
+ llama_perf_context_print: load time = 2659.91 ms
145
+ llama_perf_context_print: prompt eval time = 100053.26 ms / 30720 tokens ( 3.26 ms per token, 307.04 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 100537.54 ms / 30721 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16177 + ( 4345 = 3147 + 80 + 1117) + 3592 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19962 + ( 3421 = 3147 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 15447 = 15080 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_math.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20420 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 20.88 GiB (4.96 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 15080.99 MiB
106
+ load_tensors: CUDA0 model buffer size = 3147.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 3147.73 MiB
108
+ .................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 1117.84 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 44.701 ms
139
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.92 seconds per pass - ETA 1.83 minutes
141
+ [1]2.7663,[2]2.9032,[3]3.3313,[4]3.5822,[5]4.0935,[6]4.3647,[7]4.5735,[8]4.7008,[9]4.8480,[10]4.9997,[11]5.0763,[12]5.1539,[13]5.2864,[14]5.3967,[15]5.4243,[16]5.4388,
142
+ Final estimate: PPL = 5.4388 +/- 0.11973
143
+
144
+ llama_perf_context_print: load time = 2648.84 ms
145
+ llama_perf_context_print: prompt eval time = 106970.58 ms / 32768 tokens ( 3.26 ms per token, 306.33 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 107483.72 ms / 32769 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 15873 + ( 4345 = 3147 + 80 + 1117) + 3896 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19962 + ( 3421 = 3147 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 15447 = 15080 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/bench_metrics.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "raw_metrics": {
3
+ "llamabench": {
4
+ "backend": "CUDA",
5
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/llamabench.md",
6
+ "ngl": "35",
7
+ "raw_row": {
8
+ "backend": "CUDA",
9
+ "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
10
+ "ngl": "35",
11
+ "params": "36.15 B",
12
+ "size": "27.86 GiB",
13
+ "t/s": "19.03 \u00b1 0.59",
14
+ "test": "pp8",
15
+ "tps_value": 19.03
16
+ },
17
+ "test": "pp8",
18
+ "tps": 19.03
19
+ },
20
+ "perplexity": {
21
+ "code": {
22
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_code.log",
23
+ "ppl": 1.4133,
24
+ "ppl_error": 0.00946
25
+ },
26
+ "general": {
27
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_general.log",
28
+ "ppl": 6.8037,
29
+ "ppl_error": 0.16387
30
+ },
31
+ "math": {
32
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_math.log",
33
+ "ppl": 5.3769,
34
+ "ppl_error": 0.11787
35
+ }
36
+ }
37
+ },
38
+ "summary": {
39
+ "avg_prec_loss_pct": 0.828,
40
+ "bench_tps": 19.03,
41
+ "file_size_bytes": 29924678432,
42
+ "file_size_gb": 27.87
43
+ }
44
+ }
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/llamabench.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 27.86 GiB | 36.15 B | CUDA | 35 | pp8 | 19.03 ± 0.59 |
9
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 27.86 GiB | 36.15 B | CUDA | 35 | tg128 | 3.36 ± 0.01 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_code.log ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19183 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 128 tensors
46
+ llama_model_loader: - type q5_K: 65 tensors
47
+ llama_model_loader: - type iq4_nl: 257 tensors
48
+ print_info: file format = GGUF V3 (latest)
49
+ print_info: file type = IQ4_NL - 4.5 bpw
50
+ print_info: file size = 27.86 GiB (6.62 BPW)
51
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
52
+ load: printing all EOG tokens:
53
+ load: - 2 ('<seed:eos>')
54
+ load: special tokens cache size = 128
55
+ load: token to piece cache size = 0.9296 MB
56
+ print_info: arch = seed_oss
57
+ print_info: vocab_only = 0
58
+ print_info: n_ctx_train = 524288
59
+ print_info: n_embd = 5120
60
+ print_info: n_embd_inp = 5120
61
+ print_info: n_layer = 64
62
+ print_info: n_head = 80
63
+ print_info: n_head_kv = 8
64
+ print_info: n_rot = 128
65
+ print_info: n_swa = 0
66
+ print_info: is_swa_any = 0
67
+ print_info: n_embd_head_k = 128
68
+ print_info: n_embd_head_v = 128
69
+ print_info: n_gqa = 10
70
+ print_info: n_embd_k_gqa = 1024
71
+ print_info: n_embd_v_gqa = 1024
72
+ print_info: f_norm_eps = 0.0e+00
73
+ print_info: f_norm_rms_eps = 1.0e-06
74
+ print_info: f_clamp_kqv = 0.0e+00
75
+ print_info: f_max_alibi_bias = 0.0e+00
76
+ print_info: f_logit_scale = 0.0e+00
77
+ print_info: f_attn_scale = 0.0e+00
78
+ print_info: n_ff = 27648
79
+ print_info: n_expert = 0
80
+ print_info: n_expert_used = 0
81
+ print_info: n_expert_groups = 0
82
+ print_info: n_group_used = 0
83
+ print_info: causal attn = 1
84
+ print_info: pooling type = 0
85
+ print_info: rope type = 2
86
+ print_info: rope scaling = linear
87
+ print_info: freq_base_train = 10000000.0
88
+ print_info: freq_scale_train = 1
89
+ print_info: n_ctx_orig_yarn = 524288
90
+ print_info: rope_finetuned = unknown
91
+ print_info: model type = 36B
92
+ print_info: model params = 36.15 B
93
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
94
+ print_info: vocab type = BPE
95
+ print_info: n_vocab = 155136
96
+ print_info: n_merges = 154737
97
+ print_info: BOS token = 0 '<seed:bos>'
98
+ print_info: EOS token = 2 '<seed:eos>'
99
+ print_info: PAD token = 1 '<seed:pad>'
100
+ print_info: LF token = 326 'Ċ'
101
+ print_info: EOG token = 2 '<seed:eos>'
102
+ print_info: max token length = 1024
103
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
104
+ load_tensors: offloading 20 repeating layers to GPU
105
+ load_tensors: offloaded 20/65 layers to GPU
106
+ load_tensors: CPU_Mapped model buffer size = 19911.93 MiB
107
+ load_tensors: CUDA0 model buffer size = 3879.21 MiB
108
+ load_tensors: CUDA1 model buffer size = 4741.26 MiB
109
+ ...................................................................................................
110
+ llama_context: constructing llama_context
111
+ llama_context: n_seq_max = 1
112
+ llama_context: n_ctx = 2048
113
+ llama_context: n_ctx_seq = 2048
114
+ llama_context: n_batch = 2048
115
+ llama_context: n_ubatch = 512
116
+ llama_context: causal_attn = 1
117
+ llama_context: flash_attn = auto
118
+ llama_context: kv_unified = false
119
+ llama_context: freq_base = 10000000.0
120
+ llama_context: freq_scale = 1
121
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
122
+ llama_context: CPU output buffer size = 0.59 MiB
123
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
124
+ llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
125
+ llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
126
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
127
+ llama_context: Flash Attention was auto, set to enabled
128
+ llama_context: CUDA0 compute buffer size = 833.78 MiB
129
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
130
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
131
+ llama_context: graph nodes = 2183
132
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
133
+ common_init_from_params: added <seed:eos> logit bias = -inf
134
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
135
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
136
+
137
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
138
+ perplexity: tokenizing the input ..
139
+ perplexity: tokenization took 121.381 ms
140
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
141
+ perplexity: 8.93 seconds per pass - ETA 7.13 minutes
142
+ [1]1.5570,[2]1.4577,[3]1.2857,[4]1.2317,[5]1.1880,[6]1.2751,[7]1.3796,[8]1.4376,[9]1.4205,[10]1.3983,[11]1.3764,[12]1.3819,[13]1.3820,[14]1.3678,[15]1.3488,[16]1.3645,[17]1.3655,[18]1.3472,[19]1.3450,[20]1.3606,[21]1.3508,[22]1.3407,[23]1.3511,[24]1.3456,[25]1.3494,[26]1.3454,[27]1.3623,[28]1.3675,[29]1.3677,[30]1.3685,[31]1.3657,[32]1.3763,[33]1.3769,[34]1.3693,[35]1.3651,[36]1.3603,[37]1.3681,[38]1.3769,[39]1.3684,[40]1.3900,[41]1.3990,[42]1.4020,[43]1.4105,[44]1.4116,[45]1.4052,[46]1.4081,[47]1.4121,[48]1.4133,
143
+ Final estimate: PPL = 1.4133 +/- 0.00946
144
+
145
+ llama_perf_context_print: load time = 3785.94 ms
146
+ llama_perf_context_print: prompt eval time = 416431.87 ms / 98304 tokens ( 4.24 ms per token, 236.06 tokens per second)
147
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
148
+ llama_perf_context_print: total time = 418154.88 ms / 98305 tokens
149
+ llama_perf_context_print: graphs reused = 0
150
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
151
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 14172 + ( 4784 = 3879 + 72 + 833) + 5157 |
152
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 18360 + ( 5023 = 4741 + 88 + 194) + 740 |
153
+ llama_memory_breakdown_print: | - Host | 20277 = 19911 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_general.log ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19144 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 128 tensors
46
+ llama_model_loader: - type q5_K: 65 tensors
47
+ llama_model_loader: - type iq4_nl: 257 tensors
48
+ print_info: file format = GGUF V3 (latest)
49
+ print_info: file type = IQ4_NL - 4.5 bpw
50
+ print_info: file size = 27.86 GiB (6.62 BPW)
51
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
52
+ load: printing all EOG tokens:
53
+ load: - 2 ('<seed:eos>')
54
+ load: special tokens cache size = 128
55
+ load: token to piece cache size = 0.9296 MB
56
+ print_info: arch = seed_oss
57
+ print_info: vocab_only = 0
58
+ print_info: n_ctx_train = 524288
59
+ print_info: n_embd = 5120
60
+ print_info: n_embd_inp = 5120
61
+ print_info: n_layer = 64
62
+ print_info: n_head = 80
63
+ print_info: n_head_kv = 8
64
+ print_info: n_rot = 128
65
+ print_info: n_swa = 0
66
+ print_info: is_swa_any = 0
67
+ print_info: n_embd_head_k = 128
68
+ print_info: n_embd_head_v = 128
69
+ print_info: n_gqa = 10
70
+ print_info: n_embd_k_gqa = 1024
71
+ print_info: n_embd_v_gqa = 1024
72
+ print_info: f_norm_eps = 0.0e+00
73
+ print_info: f_norm_rms_eps = 1.0e-06
74
+ print_info: f_clamp_kqv = 0.0e+00
75
+ print_info: f_max_alibi_bias = 0.0e+00
76
+ print_info: f_logit_scale = 0.0e+00
77
+ print_info: f_attn_scale = 0.0e+00
78
+ print_info: n_ff = 27648
79
+ print_info: n_expert = 0
80
+ print_info: n_expert_used = 0
81
+ print_info: n_expert_groups = 0
82
+ print_info: n_group_used = 0
83
+ print_info: causal attn = 1
84
+ print_info: pooling type = 0
85
+ print_info: rope type = 2
86
+ print_info: rope scaling = linear
87
+ print_info: freq_base_train = 10000000.0
88
+ print_info: freq_scale_train = 1
89
+ print_info: n_ctx_orig_yarn = 524288
90
+ print_info: rope_finetuned = unknown
91
+ print_info: model type = 36B
92
+ print_info: model params = 36.15 B
93
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
94
+ print_info: vocab type = BPE
95
+ print_info: n_vocab = 155136
96
+ print_info: n_merges = 154737
97
+ print_info: BOS token = 0 '<seed:bos>'
98
+ print_info: EOS token = 2 '<seed:eos>'
99
+ print_info: PAD token = 1 '<seed:pad>'
100
+ print_info: LF token = 326 'Ċ'
101
+ print_info: EOG token = 2 '<seed:eos>'
102
+ print_info: max token length = 1024
103
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
104
+ load_tensors: offloading 20 repeating layers to GPU
105
+ load_tensors: offloaded 20/65 layers to GPU
106
+ load_tensors: CPU_Mapped model buffer size = 19911.93 MiB
107
+ load_tensors: CUDA0 model buffer size = 3879.21 MiB
108
+ load_tensors: CUDA1 model buffer size = 4741.26 MiB
109
+ ...................................................................................................
110
+ llama_context: constructing llama_context
111
+ llama_context: n_seq_max = 1
112
+ llama_context: n_ctx = 2048
113
+ llama_context: n_ctx_seq = 2048
114
+ llama_context: n_batch = 2048
115
+ llama_context: n_ubatch = 512
116
+ llama_context: causal_attn = 1
117
+ llama_context: flash_attn = auto
118
+ llama_context: kv_unified = false
119
+ llama_context: freq_base = 10000000.0
120
+ llama_context: freq_scale = 1
121
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
122
+ llama_context: CPU output buffer size = 0.59 MiB
123
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
124
+ llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
125
+ llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
126
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
127
+ llama_context: Flash Attention was auto, set to enabled
128
+ llama_context: CUDA0 compute buffer size = 833.78 MiB
129
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
130
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
131
+ llama_context: graph nodes = 2183
132
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
133
+ common_init_from_params: added <seed:eos> logit bias = -inf
134
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
135
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
136
+
137
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
138
+ perplexity: tokenizing the input ..
139
+ perplexity: tokenization took 49.451 ms
140
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
141
+ perplexity: 8.88 seconds per pass - ETA 2.22 minutes
142
+ [1]6.9263,[2]7.9946,[3]8.3471,[4]8.0938,[5]7.8933,[6]6.6467,[7]5.8691,[8]5.9351,[9]6.2024,[10]6.2615,[11]6.3906,[12]6.6929,[13]6.7168,[14]6.7946,[15]6.8037,
143
+ Final estimate: PPL = 6.8037 +/- 0.16387
144
+
145
+ llama_perf_context_print: load time = 3982.53 ms
146
+ llama_perf_context_print: prompt eval time = 130029.31 ms / 30720 tokens ( 4.23 ms per token, 236.25 tokens per second)
147
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
148
+ llama_perf_context_print: total time = 130712.15 ms / 30721 tokens
149
+ llama_perf_context_print: graphs reused = 0
150
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
151
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 14201 + ( 4784 = 3879 + 72 + 833) + 5128 |
152
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 18360 + ( 5023 = 4741 + 88 + 194) + 740 |
153
+ llama_memory_breakdown_print: | - Host | 20277 = 19911 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K/perplexity_math.log ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19152 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_Q8_0-lm_head_Q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 128 tensors
46
+ llama_model_loader: - type q5_K: 65 tensors
47
+ llama_model_loader: - type iq4_nl: 257 tensors
48
+ print_info: file format = GGUF V3 (latest)
49
+ print_info: file type = IQ4_NL - 4.5 bpw
50
+ print_info: file size = 27.86 GiB (6.62 BPW)
51
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
52
+ load: printing all EOG tokens:
53
+ load: - 2 ('<seed:eos>')
54
+ load: special tokens cache size = 128
55
+ load: token to piece cache size = 0.9296 MB
56
+ print_info: arch = seed_oss
57
+ print_info: vocab_only = 0
58
+ print_info: n_ctx_train = 524288
59
+ print_info: n_embd = 5120
60
+ print_info: n_embd_inp = 5120
61
+ print_info: n_layer = 64
62
+ print_info: n_head = 80
63
+ print_info: n_head_kv = 8
64
+ print_info: n_rot = 128
65
+ print_info: n_swa = 0
66
+ print_info: is_swa_any = 0
67
+ print_info: n_embd_head_k = 128
68
+ print_info: n_embd_head_v = 128
69
+ print_info: n_gqa = 10
70
+ print_info: n_embd_k_gqa = 1024
71
+ print_info: n_embd_v_gqa = 1024
72
+ print_info: f_norm_eps = 0.0e+00
73
+ print_info: f_norm_rms_eps = 1.0e-06
74
+ print_info: f_clamp_kqv = 0.0e+00
75
+ print_info: f_max_alibi_bias = 0.0e+00
76
+ print_info: f_logit_scale = 0.0e+00
77
+ print_info: f_attn_scale = 0.0e+00
78
+ print_info: n_ff = 27648
79
+ print_info: n_expert = 0
80
+ print_info: n_expert_used = 0
81
+ print_info: n_expert_groups = 0
82
+ print_info: n_group_used = 0
83
+ print_info: causal attn = 1
84
+ print_info: pooling type = 0
85
+ print_info: rope type = 2
86
+ print_info: rope scaling = linear
87
+ print_info: freq_base_train = 10000000.0
88
+ print_info: freq_scale_train = 1
89
+ print_info: n_ctx_orig_yarn = 524288
90
+ print_info: rope_finetuned = unknown
91
+ print_info: model type = 36B
92
+ print_info: model params = 36.15 B
93
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
94
+ print_info: vocab type = BPE
95
+ print_info: n_vocab = 155136
96
+ print_info: n_merges = 154737
97
+ print_info: BOS token = 0 '<seed:bos>'
98
+ print_info: EOS token = 2 '<seed:eos>'
99
+ print_info: PAD token = 1 '<seed:pad>'
100
+ print_info: LF token = 326 'Ċ'
101
+ print_info: EOG token = 2 '<seed:eos>'
102
+ print_info: max token length = 1024
103
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
104
+ load_tensors: offloading 20 repeating layers to GPU
105
+ load_tensors: offloaded 20/65 layers to GPU
106
+ load_tensors: CPU_Mapped model buffer size = 19911.93 MiB
107
+ load_tensors: CUDA0 model buffer size = 3879.21 MiB
108
+ load_tensors: CUDA1 model buffer size = 4741.26 MiB
109
+ ...................................................................................................
110
+ llama_context: constructing llama_context
111
+ llama_context: n_seq_max = 1
112
+ llama_context: n_ctx = 2048
113
+ llama_context: n_ctx_seq = 2048
114
+ llama_context: n_batch = 2048
115
+ llama_context: n_ubatch = 512
116
+ llama_context: causal_attn = 1
117
+ llama_context: flash_attn = auto
118
+ llama_context: kv_unified = false
119
+ llama_context: freq_base = 10000000.0
120
+ llama_context: freq_scale = 1
121
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
122
+ llama_context: CPU output buffer size = 0.59 MiB
123
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
124
+ llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
125
+ llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
126
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
127
+ llama_context: Flash Attention was auto, set to enabled
128
+ llama_context: CUDA0 compute buffer size = 833.78 MiB
129
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
130
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
131
+ llama_context: graph nodes = 2183
132
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
133
+ common_init_from_params: added <seed:eos> logit bias = -inf
134
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
135
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
136
+
137
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
138
+ perplexity: tokenizing the input ..
139
+ perplexity: tokenization took 45.858 ms
140
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
141
+ perplexity: 8.86 seconds per pass - ETA 2.35 minutes
142
+ [1]2.7024,[2]2.8391,[3]3.2683,[4]3.5229,[5]4.0384,[6]4.3092,[7]4.5171,[8]4.6441,[9]4.7886,[10]4.9372,[11]5.0187,[12]5.0935,[13]5.2285,[14]5.3377,[15]5.3687,[16]5.3769,
143
+ Final estimate: PPL = 5.3769 +/- 0.11787
144
+
145
+ llama_perf_context_print: load time = 3518.68 ms
146
+ llama_perf_context_print: prompt eval time = 138566.37 ms / 32768 tokens ( 4.23 ms per token, 236.48 tokens per second)
147
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
148
+ llama_perf_context_print: total time = 139097.25 ms / 32769 tokens
149
+ llama_perf_context_print: graphs reused = 0
150
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
151
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 14169 + ( 4784 = 3879 + 72 + 833) + 5160 |
152
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 18360 + ( 5023 = 4741 + 88 + 194) + 740 |
153
+ llama_memory_breakdown_print: | - Host | 20277 = 19911 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "raw_metrics": {
3
+ "llamabench": {
4
+ "backend": "CUDA",
5
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
6
+ "ngl": "35",
7
+ "raw_row": {
8
+ "backend": "CUDA",
9
+ "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
10
+ "ngl": "35",
11
+ "params": "36.15 B",
12
+ "size": "19.04 GiB",
13
+ "t/s": "28.78 \u00b1 3.16",
14
+ "test": "pp8",
15
+ "tps_value": 28.78
16
+ },
17
+ "test": "pp8",
18
+ "tps": 28.78
19
+ },
20
+ "perplexity": {
21
+ "code": {
22
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
23
+ "ppl": 1.4161,
24
+ "ppl_error": 0.00948
25
+ },
26
+ "general": {
27
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
28
+ "ppl": 6.8715,
29
+ "ppl_error": 0.16547
30
+ },
31
+ "math": {
32
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
33
+ "ppl": 5.4643,
34
+ "ppl_error": 0.12019
35
+ }
36
+ }
37
+ },
38
+ "summary": {
39
+ "avg_prec_loss_pct": 0.2769,
40
+ "bench_tps": 28.78,
41
+ "file_size_bytes": 20445551392,
42
+ "file_size_gb": 19.04
43
+ }
44
+ }
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.04 GiB | 36.15 B | CUDA | 35 | pp8 | 28.78 ± 3.16 |
9
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.04 GiB | 36.15 B | CUDA | 35 | tg128 | 5.28 ± 0.05 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20350 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q5_K: 1 tensors
46
+ llama_model_loader: - type iq4_nl: 449 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.04 GiB (4.52 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13696.93 MiB
106
+ load_tensors: CUDA0 model buffer size = 2897.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 2897.73 MiB
108
+ ..................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 739.09 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 113.916 ms
139
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.62 seconds per pass - ETA 5.28 minutes
141
+ [1]1.5690,[2]1.4709,[3]1.2935,[4]1.2379,[5]1.1930,[6]1.2804,[7]1.3859,[8]1.4444,[9]1.4266,[10]1.4032,[11]1.3804,[12]1.3856,[13]1.3859,[14]1.3713,[15]1.3522,[16]1.3675,[17]1.3691,[18]1.3504,[19]1.3479,[20]1.3635,[21]1.3537,[22]1.3435,[23]1.3540,[24]1.3484,[25]1.3513,[26]1.3473,[27]1.3644,[28]1.3698,[29]1.3702,[30]1.3712,[31]1.3684,[32]1.3794,[33]1.3801,[34]1.3725,[35]1.3682,[36]1.3631,[37]1.3710,[38]1.3798,[39]1.3712,[40]1.3934,[41]1.4024,[42]1.4055,[43]1.4139,[44]1.4150,[45]1.4083,[46]1.4112,[47]1.4148,[48]1.4161,
142
+ Final estimate: PPL = 1.4161 +/- 0.00948
143
+
144
+ llama_perf_context_print: load time = 2643.34 ms
145
+ llama_perf_context_print: prompt eval time = 306009.13 ms / 98304 tokens ( 3.11 ms per token, 321.25 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 307563.05 ms / 98305 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16500 + ( 3716 = 2897 + 80 + 739) + 3898 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 14062 = 13696 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20255 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q5_K: 1 tensors
46
+ llama_model_loader: - type iq4_nl: 449 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.04 GiB (4.52 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13696.93 MiB
106
+ load_tensors: CUDA0 model buffer size = 2897.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 2897.73 MiB
108
+ ..................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 739.09 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 50.4 ms
139
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.59 seconds per pass - ETA 1.63 minutes
141
+ [1]7.0029,[2]8.1095,[3]8.5099,[4]8.2651,[5]8.0598,[6]6.7509,[7]5.9414,[8]5.9974,[9]6.2556,[10]6.3166,[11]6.4537,[12]6.7592,[13]6.7836,[14]6.8633,[15]6.8715,
142
+ Final estimate: PPL = 6.8715 +/- 0.16547
143
+
144
+ llama_perf_context_print: load time = 2509.91 ms
145
+ llama_perf_context_print: prompt eval time = 95548.39 ms / 30720 tokens ( 3.11 ms per token, 321.51 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 96051.84 ms / 30721 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16436 + ( 3716 = 2897 + 80 + 739) + 3961 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 14062 = 13696 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20416 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q5_K: 1 tensors
46
+ llama_model_loader: - type iq4_nl: 449 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.04 GiB (4.52 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13696.93 MiB
106
+ load_tensors: CUDA0 model buffer size = 2897.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 2897.73 MiB
108
+ ..................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 739.09 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 49.049 ms
139
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.64 seconds per pass - ETA 1.77 minutes
141
+ [1]2.7786,[2]2.9173,[3]3.3383,[4]3.6012,[5]4.1247,[6]4.3961,[7]4.6081,[8]4.7268,[9]4.8720,[10]5.0228,[11]5.0974,[12]5.1771,[13]5.3097,[14]5.4174,[15]5.4507,[16]5.4643,
142
+ Final estimate: PPL = 5.4643 +/- 0.12019
143
+
144
+ llama_perf_context_print: load time = 2506.68 ms
145
+ llama_perf_context_print: prompt eval time = 102309.14 ms / 32768 tokens ( 3.12 ms per token, 320.28 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 102843.30 ms / 32769 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16414 + ( 3716 = 2897 + 80 + 739) + 3984 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 14062 = 13696 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "raw_metrics": {
3
+ "llamabench": {
4
+ "backend": "CUDA",
5
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
6
+ "ngl": "35",
7
+ "raw_row": {
8
+ "backend": "CUDA",
9
+ "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
10
+ "ngl": "35",
11
+ "params": "36.15 B",
12
+ "size": "19.13 GiB",
13
+ "t/s": "26.86 \u00b1 2.90",
14
+ "test": "pp8",
15
+ "tps_value": 26.86
16
+ },
17
+ "test": "pp8",
18
+ "tps": 26.86
19
+ },
20
+ "perplexity": {
21
+ "code": {
22
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
23
+ "ppl": 1.4159,
24
+ "ppl_error": 0.00948
25
+ },
26
+ "general": {
27
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
28
+ "ppl": 6.8703,
29
+ "ppl_error": 0.16545
30
+ },
31
+ "math": {
32
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
33
+ "ppl": 5.4647,
34
+ "ppl_error": 0.12022
35
+ }
36
+ }
37
+ },
38
+ "summary": {
39
+ "avg_prec_loss_pct": 0.2805,
40
+ "bench_tps": 26.86,
41
+ "file_size_bytes": 20551043872,
42
+ "file_size_gb": 19.14
43
+ }
44
+ }
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.13 GiB | 36.15 B | CUDA | 35 | pp8 | 26.86 ± 2.90 |
9
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.13 GiB | 36.15 B | CUDA | 35 | tg128 | 5.33 ± 0.01 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20410 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q6_K: 1 tensors
46
+ llama_model_loader: - type iq4_nl: 449 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.13 GiB (4.55 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13797.53 MiB
106
+ load_tensors: CUDA0 model buffer size = 2897.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 2897.73 MiB
108
+ .................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 739.09 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 112.004 ms
139
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.48 seconds per pass - ETA 5.18 minutes
141
+ [1]1.5664,[2]1.4694,[3]1.2926,[4]1.2373,[5]1.1926,[6]1.2807,[7]1.3863,[8]1.4452,[9]1.4274,[10]1.4039,[11]1.3810,[12]1.3861,[13]1.3865,[14]1.3716,[15]1.3526,[16]1.3677,[17]1.3693,[18]1.3506,[19]1.3481,[20]1.3637,[21]1.3539,[22]1.3437,[23]1.3542,[24]1.3486,[25]1.3515,[26]1.3473,[27]1.3644,[28]1.3698,[29]1.3702,[30]1.3711,[31]1.3684,[32]1.3793,[33]1.3800,[34]1.3723,[35]1.3680,[36]1.3630,[37]1.3709,[38]1.3797,[39]1.3710,[40]1.3932,[41]1.4022,[42]1.4052,[43]1.4136,[44]1.4147,[45]1.4080,[46]1.4109,[47]1.4146,[48]1.4159,
142
+ Final estimate: PPL = 1.4159 +/- 0.00948
143
+
144
+ llama_perf_context_print: load time = 2495.13 ms
145
+ llama_perf_context_print: prompt eval time = 299791.84 ms / 98304 tokens ( 3.05 ms per token, 327.91 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 301299.08 ms / 98305 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16504 + ( 3716 = 2897 + 80 + 739) + 3893 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 14163 = 13797 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20404 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q6_K: 1 tensors
46
+ llama_model_loader: - type iq4_nl: 449 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.13 GiB (4.55 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13797.53 MiB
106
+ load_tensors: CUDA0 model buffer size = 2897.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 2897.73 MiB
108
+ .................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 739.09 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 49.617 ms
139
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.44 seconds per pass - ETA 1.60 minutes
141
+ [1]7.0036,[2]8.1097,[3]8.5035,[4]8.2616,[5]8.0590,[6]6.7519,[7]5.9419,[8]5.9978,[9]6.2567,[10]6.3171,[11]6.4548,[12]6.7584,[13]6.7832,[14]6.8623,[15]6.8703,
142
+ Final estimate: PPL = 6.8703 +/- 0.16545
143
+
144
+ llama_perf_context_print: load time = 2488.76 ms
145
+ llama_perf_context_print: prompt eval time = 93409.83 ms / 30720 tokens ( 3.04 ms per token, 328.87 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 93894.78 ms / 30721 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16496 + ( 3716 = 2897 + 80 + 739) + 3901 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 14163 = 13797 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20420 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q6_K: 1 tensors
46
+ llama_model_loader: - type iq4_nl: 449 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.13 GiB (4.55 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13797.53 MiB
106
+ load_tensors: CUDA0 model buffer size = 2897.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 2897.73 MiB
108
+ .................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 739.09 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 44.823 ms
139
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.50 seconds per pass - ETA 1.72 minutes
141
+ [1]2.7835,[2]2.9216,[3]3.3409,[4]3.6013,[5]4.1248,[6]4.3958,[7]4.6067,[8]4.7254,[9]4.8714,[10]5.0227,[11]5.0961,[12]5.1761,[13]5.3092,[14]5.4165,[15]5.4500,[16]5.4647,
142
+ Final estimate: PPL = 5.4647 +/- 0.12022
143
+
144
+ llama_perf_context_print: load time = 2511.76 ms
145
+ llama_perf_context_print: prompt eval time = 100187.11 ms / 32768 tokens ( 3.06 ms per token, 327.07 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 100696.88 ms / 32769 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16501 + ( 3716 = 2897 + 80 + 739) + 3896 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 14163 = 13797 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "raw_metrics": {
3
+ "llamabench": {
4
+ "backend": "CUDA",
5
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
6
+ "ngl": "35",
7
+ "raw_row": {
8
+ "backend": "CUDA",
9
+ "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
10
+ "ngl": "35",
11
+ "params": "36.15 B",
12
+ "size": "19.31 GiB",
13
+ "t/s": "25.73 \u00b1 2.34",
14
+ "test": "pp8",
15
+ "tps_value": 25.73
16
+ },
17
+ "test": "pp8",
18
+ "tps": 25.73
19
+ },
20
+ "perplexity": {
21
+ "code": {
22
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
23
+ "ppl": 1.4159,
24
+ "ppl_error": 0.00947
25
+ },
26
+ "general": {
27
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
28
+ "ppl": 6.8736,
29
+ "ppl_error": 0.16558
30
+ },
31
+ "math": {
32
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
33
+ "ppl": 5.4656,
34
+ "ppl_error": 0.12023
35
+ }
36
+ }
37
+ },
38
+ "summary": {
39
+ "avg_prec_loss_pct": 0.27,
40
+ "bench_tps": 25.73,
41
+ "file_size_bytes": 20743412512,
42
+ "file_size_gb": 19.32
43
+ }
44
+ }
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.31 GiB | 36.15 B | CUDA | 35 | pp8 | 25.73 ± 2.34 |
9
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.31 GiB | 36.15 B | CUDA | 35 | tg128 | 5.32 ± 0.00 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20429 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 1 tensors
46
+ llama_model_loader: - type iq4_nl: 449 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.31 GiB (4.59 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13980.99 MiB
106
+ load_tensors: CUDA0 model buffer size = 2897.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 2897.73 MiB
108
+ ................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 739.09 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 111.451 ms
139
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.45 seconds per pass - ETA 5.15 minutes
141
+ [1]1.5697,[2]1.4711,[3]1.2936,[4]1.2378,[5]1.1929,[6]1.2810,[7]1.3862,[8]1.4451,[9]1.4271,[10]1.4037,[11]1.3808,[12]1.3860,[13]1.3864,[14]1.3715,[15]1.3525,[16]1.3677,[17]1.3692,[18]1.3505,[19]1.3480,[20]1.3636,[21]1.3538,[22]1.3436,[23]1.3541,[24]1.3485,[25]1.3513,[26]1.3472,[27]1.3643,[28]1.3697,[29]1.3701,[30]1.3711,[31]1.3683,[32]1.3792,[33]1.3800,[34]1.3723,[35]1.3680,[36]1.3630,[37]1.3709,[38]1.3796,[39]1.3710,[40]1.3932,[41]1.4021,[42]1.4052,[43]1.4137,[44]1.4148,[45]1.4081,[46]1.4110,[47]1.4146,[48]1.4159,
142
+ Final estimate: PPL = 1.4159 +/- 0.00947
143
+
144
+ llama_perf_context_print: load time = 2493.93 ms
145
+ llama_perf_context_print: prompt eval time = 299600.26 ms / 98304 tokens ( 3.05 ms per token, 328.12 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 301103.88 ms / 98305 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16531 + ( 3716 = 2897 + 80 + 739) + 3866 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 14347 = 13980 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20412 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 1 tensors
46
+ llama_model_loader: - type iq4_nl: 449 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.31 GiB (4.59 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13980.99 MiB
106
+ load_tensors: CUDA0 model buffer size = 2897.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 2897.73 MiB
108
+ ................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 739.09 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 46.856 ms
139
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.44 seconds per pass - ETA 1.60 minutes
141
+ [1]7.0159,[2]8.1163,[3]8.5190,[4]8.2693,[5]8.0635,[6]6.7555,[7]5.9452,[8]6.0011,[9]6.2586,[10]6.3193,[11]6.4576,[12]6.7629,[13]6.7864,[14]6.8658,[15]6.8736,
142
+ Final estimate: PPL = 6.8736 +/- 0.16558
143
+
144
+ llama_perf_context_print: load time = 2645.96 ms
145
+ llama_perf_context_print: prompt eval time = 93386.51 ms / 30720 tokens ( 3.04 ms per token, 328.96 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 93871.31 ms / 30721 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16501 + ( 3716 = 2897 + 80 + 739) + 3897 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 14347 = 13980 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20450 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 1 tensors
46
+ llama_model_loader: - type iq4_nl: 449 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.31 GiB (4.59 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13980.99 MiB
106
+ load_tensors: CUDA0 model buffer size = 2897.73 MiB
107
+ load_tensors: CUDA1 model buffer size = 2897.73 MiB
108
+ ................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 739.09 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 46.739 ms
139
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.48 seconds per pass - ETA 1.72 minutes
141
+ [1]2.7843,[2]2.9209,[3]3.3419,[4]3.6083,[5]4.1312,[6]4.4011,[7]4.6125,[8]4.7301,[9]4.8745,[10]5.0244,[11]5.0976,[12]5.1776,[13]5.3111,[14]5.4185,[15]5.4517,[16]5.4656,
142
+ Final estimate: PPL = 5.4656 +/- 0.12023
143
+
144
+ llama_perf_context_print: load time = 2485.58 ms
145
+ llama_perf_context_print: prompt eval time = 99922.90 ms / 32768 tokens ( 3.05 ms per token, 327.93 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 100433.38 ms / 32769 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16535 + ( 3716 = 2897 + 80 + 739) + 3862 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20212 + ( 3171 = 2897 + 80 + 194) + 739 |
152
+ llama_memory_breakdown_print: | - Host | 14347 = 13980 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "raw_metrics": {
3
+ "llamabench": {
4
+ "backend": "CUDA",
5
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md",
6
+ "ngl": "35",
7
+ "raw_row": {
8
+ "backend": "CUDA",
9
+ "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
10
+ "ngl": "35",
11
+ "params": "36.15 B",
12
+ "size": "19.43 GiB",
13
+ "t/s": "28.76 \u00b1 0.96",
14
+ "test": "pp8",
15
+ "tps_value": 28.76
16
+ },
17
+ "test": "pp8",
18
+ "tps": 28.76
19
+ },
20
+ "perplexity": {
21
+ "code": {
22
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log",
23
+ "ppl": 1.4176,
24
+ "ppl_error": 0.00953
25
+ },
26
+ "general": {
27
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log",
28
+ "ppl": 6.8507,
29
+ "ppl_error": 0.16499
30
+ },
31
+ "math": {
32
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log",
33
+ "ppl": 5.4384,
34
+ "ppl_error": 0.1198
35
+ }
36
+ }
37
+ },
38
+ "summary": {
39
+ "avg_prec_loss_pct": 0.3254,
40
+ "bench_tps": 28.76,
41
+ "file_size_bytes": 20864981792,
42
+ "file_size_gb": 19.43
43
+ }
44
+ }
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.43 GiB | 36.15 B | CUDA | 35 | pp8 | 28.76 ± 0.96 |
9
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 19.43 GiB | 36.15 B | CUDA | 35 | tg128 | 4.87 ± 0.01 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19133 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q5_K: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.43 GiB (4.62 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
106
+ load_tensors: CUDA0 model buffer size = 2664.21 MiB
107
+ load_tensors: CUDA1 model buffer size = 3256.26 MiB
108
+ ..................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 833.78 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 115.322 ms
139
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.91 seconds per pass - ETA 5.52 minutes
141
+ [1]1.5663,[2]1.4687,[3]1.2922,[4]1.2374,[5]1.1926,[6]1.2795,[7]1.3859,[8]1.4450,[9]1.4272,[10]1.4042,[11]1.3816,[12]1.3867,[13]1.3871,[14]1.3725,[15]1.3537,[16]1.3689,[17]1.3703,[18]1.3515,[19]1.3491,[20]1.3652,[21]1.3554,[22]1.3451,[23]1.3557,[24]1.3503,[25]1.3534,[26]1.3494,[27]1.3664,[28]1.3717,[29]1.3721,[30]1.3730,[31]1.3704,[32]1.3812,[33]1.3819,[34]1.3743,[35]1.3700,[36]1.3651,[37]1.3731,[38]1.3820,[39]1.3733,[40]1.3951,[41]1.4041,[42]1.4071,[43]1.4154,[44]1.4164,[45]1.4097,[46]1.4126,[47]1.4164,[48]1.4176,
142
+ Final estimate: PPL = 1.4176 +/- 0.00953
143
+
144
+ llama_perf_context_print: load time = 2582.59 ms
145
+ llama_perf_context_print: prompt eval time = 321034.86 ms / 98304 tokens ( 3.27 ms per token, 306.21 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 322609.98 ms / 98305 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 15274 + ( 3569 = 2664 + 72 + 833) + 5271 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19844 + ( 3538 = 3256 + 88 + 194) + 741 |
152
+ llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19177 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q5_K: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.43 GiB (4.62 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
106
+ load_tensors: CUDA0 model buffer size = 2664.21 MiB
107
+ load_tensors: CUDA1 model buffer size = 3256.26 MiB
108
+ ..................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 833.78 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 49.576 ms
139
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.90 seconds per pass - ETA 1.72 minutes
141
+ [1]6.9757,[2]8.0597,[3]8.4705,[4]8.2112,[5]8.0009,[6]6.7181,[7]5.9191,[8]5.9832,[9]6.2474,[10]6.3077,[11]6.4380,[12]6.7394,[13]6.7657,[14]6.8428,[15]6.8507,
142
+ Final estimate: PPL = 6.8507 +/- 0.16499
143
+
144
+ llama_perf_context_print: load time = 2546.10 ms
145
+ llama_perf_context_print: prompt eval time = 99968.38 ms / 30720 tokens ( 3.25 ms per token, 307.30 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 100472.57 ms / 30721 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 15365 + ( 3569 = 2664 + 72 + 833) + 5180 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19844 + ( 3538 = 3256 + 88 + 194) + 741 |
152
+ llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19043 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_MXFP4-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q5_K: 65 tensors
46
+ llama_model_loader: - type iq4_nl: 385 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = IQ4_NL - 4.5 bpw
49
+ print_info: file size = 19.43 GiB (4.62 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 13971.93 MiB
106
+ load_tensors: CUDA0 model buffer size = 2664.21 MiB
107
+ load_tensors: CUDA1 model buffer size = 3256.26 MiB
108
+ ..................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 72.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 88.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 833.78 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 47.332 ms
139
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 6.93 seconds per pass - ETA 1.83 minutes
141
+ [1]2.7608,[2]2.9054,[3]3.3323,[4]3.5844,[5]4.0966,[6]4.3685,[7]4.5745,[8]4.7023,[9]4.8500,[10]5.0005,[11]5.0791,[12]5.1544,[13]5.2872,[14]5.3968,[15]5.4247,[16]5.4384,
142
+ Final estimate: PPL = 5.4384 +/- 0.11980
143
+
144
+ llama_perf_context_print: load time = 2673.19 ms
145
+ llama_perf_context_print: prompt eval time = 107032.67 ms / 32768 tokens ( 3.27 ms per token, 306.15 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 107563.78 ms / 32769 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 15331 + ( 3569 = 2664 + 72 + 833) + 5213 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19844 + ( 3538 = 3256 + 88 + 194) + 741 |
152
+ llama_memory_breakdown_print: | - Host | 14337 = 13971 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/bench_metrics.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "raw_metrics": {
3
+ "llamabench": {
4
+ "backend": "CUDA",
5
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md",
6
+ "ngl": "35",
7
+ "raw_row": {
8
+ "backend": "CUDA",
9
+ "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
10
+ "ngl": "35",
11
+ "params": "36.15 B",
12
+ "size": "67.34 GiB",
13
+ "t/s": "11.32 \u00b1 0.12",
14
+ "test": "pp8",
15
+ "tps_value": 11.32
16
+ },
17
+ "test": "pp8",
18
+ "tps": 11.32
19
+ },
20
+ "perplexity": {
21
+ "code": {
22
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_code.log",
23
+ "ppl": 1.4128,
24
+ "ppl_error": 0.00952
25
+ },
26
+ "general": {
27
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_general.log",
28
+ "ppl": 6.8872,
29
+ "ppl_error": 0.16794
30
+ },
31
+ "math": {
32
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_math.log",
33
+ "ppl": 5.4442,
34
+ "ppl_error": 0.12088
35
+ }
36
+ }
37
+ },
38
+ "summary": {
39
+ "avg_prec_loss_pct": 0.0,
40
+ "bench_tps": 11.32,
41
+ "file_size_bytes": 72311397152,
42
+ "file_size_gb": 67.35
43
+ }
44
+ }
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 67.34 GiB | 36.15 B | CUDA | 35 | pp8 | 11.32 ± 0.12 |
9
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 67.34 GiB | 36.15 B | CUDA | 35 | tg128 | 1.53 ± 0.02 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_code.log ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19670 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type bf16: 450 tensors
46
+ print_info: file format = GGUF V3 (latest)
47
+ print_info: file type = IQ4_NL - 4.5 bpw
48
+ print_info: file size = 67.34 GiB (16.00 BPW)
49
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
50
+ load: printing all EOG tokens:
51
+ load: - 2 ('<seed:eos>')
52
+ load: special tokens cache size = 128
53
+ load: token to piece cache size = 0.9296 MB
54
+ print_info: arch = seed_oss
55
+ print_info: vocab_only = 0
56
+ print_info: n_ctx_train = 524288
57
+ print_info: n_embd = 5120
58
+ print_info: n_embd_inp = 5120
59
+ print_info: n_layer = 64
60
+ print_info: n_head = 80
61
+ print_info: n_head_kv = 8
62
+ print_info: n_rot = 128
63
+ print_info: n_swa = 0
64
+ print_info: is_swa_any = 0
65
+ print_info: n_embd_head_k = 128
66
+ print_info: n_embd_head_v = 128
67
+ print_info: n_gqa = 10
68
+ print_info: n_embd_k_gqa = 1024
69
+ print_info: n_embd_v_gqa = 1024
70
+ print_info: f_norm_eps = 0.0e+00
71
+ print_info: f_norm_rms_eps = 1.0e-06
72
+ print_info: f_clamp_kqv = 0.0e+00
73
+ print_info: f_max_alibi_bias = 0.0e+00
74
+ print_info: f_logit_scale = 0.0e+00
75
+ print_info: f_attn_scale = 0.0e+00
76
+ print_info: n_ff = 27648
77
+ print_info: n_expert = 0
78
+ print_info: n_expert_used = 0
79
+ print_info: n_expert_groups = 0
80
+ print_info: n_group_used = 0
81
+ print_info: causal attn = 1
82
+ print_info: pooling type = 0
83
+ print_info: rope type = 2
84
+ print_info: rope scaling = linear
85
+ print_info: freq_base_train = 10000000.0
86
+ print_info: freq_scale_train = 1
87
+ print_info: n_ctx_orig_yarn = 524288
88
+ print_info: rope_finetuned = unknown
89
+ print_info: model type = 36B
90
+ print_info: model params = 36.15 B
91
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
92
+ print_info: vocab type = BPE
93
+ print_info: n_vocab = 155136
94
+ print_info: n_merges = 154737
95
+ print_info: BOS token = 0 '<seed:bos>'
96
+ print_info: EOS token = 2 '<seed:eos>'
97
+ print_info: PAD token = 1 '<seed:pad>'
98
+ print_info: LF token = 326 'Ċ'
99
+ print_info: EOG token = 2 '<seed:eos>'
100
+ print_info: max token length = 1024
101
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
102
+ load_tensors: offloading 20 repeating layers to GPU
103
+ load_tensors: offloaded 20/65 layers to GPU
104
+ load_tensors: CPU_Mapped model buffer size = 48353.80 MiB
105
+ load_tensors: CUDA0 model buffer size = 10300.86 MiB
106
+ load_tensors: CUDA1 model buffer size = 10300.86 MiB
107
+ ..................................................................................................
108
+ llama_context: constructing llama_context
109
+ llama_context: n_seq_max = 1
110
+ llama_context: n_ctx = 2048
111
+ llama_context: n_ctx_seq = 2048
112
+ llama_context: n_batch = 2048
113
+ llama_context: n_ubatch = 512
114
+ llama_context: causal_attn = 1
115
+ llama_context: flash_attn = auto
116
+ llama_context: kv_unified = false
117
+ llama_context: freq_base = 10000000.0
118
+ llama_context: freq_scale = 1
119
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
120
+ llama_context: CPU output buffer size = 0.59 MiB
121
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
122
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
123
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
125
+ llama_context: Flash Attention was auto, set to enabled
126
+ llama_context: CUDA0 compute buffer size = 1828.00 MiB
127
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
128
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
129
+ llama_context: graph nodes = 2183
130
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
131
+ common_init_from_params: added <seed:eos> logit bias = -inf
132
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
133
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
134
+
135
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
136
+ perplexity: tokenizing the input ..
137
+ perplexity: tokenization took 112.237 ms
138
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
139
+ perplexity: 17.78 seconds per pass - ETA 14.22 minutes
140
+ [1]1.5107,[2]1.4416,[3]1.2762,[4]1.2238,[5]1.1809,[6]1.2685,[7]1.3738,[8]1.4318,[9]1.4155,[10]1.3932,[11]1.3715,[12]1.3774,[13]1.3779,[14]1.3640,[15]1.3454,[16]1.3621,[17]1.3633,[18]1.3450,[19]1.3424,[20]1.3583,[21]1.3485,[22]1.3382,[23]1.3488,[24]1.3431,[25]1.3473,[26]1.3431,[27]1.3609,[28]1.3662,[29]1.3668,[30]1.3675,[31]1.3649,[32]1.3754,[33]1.3757,[34]1.3681,[35]1.3643,[36]1.3595,[37]1.3672,[38]1.3761,[39]1.3676,[40]1.3894,[41]1.3983,[42]1.4012,[43]1.4096,[44]1.4109,[45]1.4046,[46]1.4078,[47]1.4116,[48]1.4128,
141
+ Final estimate: PPL = 1.4128 +/- 0.00952
142
+
143
+ llama_perf_context_print: load time = 7800.56 ms
144
+ llama_perf_context_print: prompt eval time = 840300.57 ms / 98304 tokens ( 8.55 ms per token, 116.99 tokens per second)
145
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
146
+ llama_perf_context_print: total time = 841852.62 ms / 98305 tokens
147
+ llama_perf_context_print: graphs reused = 0
148
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
149
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 6983 + (12208 = 10300 + 80 + 1828) + 4923 |
150
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12270 + (10574 = 10300 + 80 + 194) + 1279 |
151
+ llama_memory_breakdown_print: | - Host | 48719 = 48353 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_general.log ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19658 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type bf16: 450 tensors
46
+ print_info: file format = GGUF V3 (latest)
47
+ print_info: file type = IQ4_NL - 4.5 bpw
48
+ print_info: file size = 67.34 GiB (16.00 BPW)
49
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
50
+ load: printing all EOG tokens:
51
+ load: - 2 ('<seed:eos>')
52
+ load: special tokens cache size = 128
53
+ load: token to piece cache size = 0.9296 MB
54
+ print_info: arch = seed_oss
55
+ print_info: vocab_only = 0
56
+ print_info: n_ctx_train = 524288
57
+ print_info: n_embd = 5120
58
+ print_info: n_embd_inp = 5120
59
+ print_info: n_layer = 64
60
+ print_info: n_head = 80
61
+ print_info: n_head_kv = 8
62
+ print_info: n_rot = 128
63
+ print_info: n_swa = 0
64
+ print_info: is_swa_any = 0
65
+ print_info: n_embd_head_k = 128
66
+ print_info: n_embd_head_v = 128
67
+ print_info: n_gqa = 10
68
+ print_info: n_embd_k_gqa = 1024
69
+ print_info: n_embd_v_gqa = 1024
70
+ print_info: f_norm_eps = 0.0e+00
71
+ print_info: f_norm_rms_eps = 1.0e-06
72
+ print_info: f_clamp_kqv = 0.0e+00
73
+ print_info: f_max_alibi_bias = 0.0e+00
74
+ print_info: f_logit_scale = 0.0e+00
75
+ print_info: f_attn_scale = 0.0e+00
76
+ print_info: n_ff = 27648
77
+ print_info: n_expert = 0
78
+ print_info: n_expert_used = 0
79
+ print_info: n_expert_groups = 0
80
+ print_info: n_group_used = 0
81
+ print_info: causal attn = 1
82
+ print_info: pooling type = 0
83
+ print_info: rope type = 2
84
+ print_info: rope scaling = linear
85
+ print_info: freq_base_train = 10000000.0
86
+ print_info: freq_scale_train = 1
87
+ print_info: n_ctx_orig_yarn = 524288
88
+ print_info: rope_finetuned = unknown
89
+ print_info: model type = 36B
90
+ print_info: model params = 36.15 B
91
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
92
+ print_info: vocab type = BPE
93
+ print_info: n_vocab = 155136
94
+ print_info: n_merges = 154737
95
+ print_info: BOS token = 0 '<seed:bos>'
96
+ print_info: EOS token = 2 '<seed:eos>'
97
+ print_info: PAD token = 1 '<seed:pad>'
98
+ print_info: LF token = 326 'Ċ'
99
+ print_info: EOG token = 2 '<seed:eos>'
100
+ print_info: max token length = 1024
101
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
102
+ load_tensors: offloading 20 repeating layers to GPU
103
+ load_tensors: offloaded 20/65 layers to GPU
104
+ load_tensors: CPU_Mapped model buffer size = 48353.80 MiB
105
+ load_tensors: CUDA0 model buffer size = 10300.86 MiB
106
+ load_tensors: CUDA1 model buffer size = 10300.86 MiB
107
+ ..................................................................................................
108
+ llama_context: constructing llama_context
109
+ llama_context: n_seq_max = 1
110
+ llama_context: n_ctx = 2048
111
+ llama_context: n_ctx_seq = 2048
112
+ llama_context: n_batch = 2048
113
+ llama_context: n_ubatch = 512
114
+ llama_context: causal_attn = 1
115
+ llama_context: flash_attn = auto
116
+ llama_context: kv_unified = false
117
+ llama_context: freq_base = 10000000.0
118
+ llama_context: freq_scale = 1
119
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
120
+ llama_context: CPU output buffer size = 0.59 MiB
121
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
122
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
123
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
125
+ llama_context: Flash Attention was auto, set to enabled
126
+ llama_context: CUDA0 compute buffer size = 1828.00 MiB
127
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
128
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
129
+ llama_context: graph nodes = 2183
130
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
131
+ common_init_from_params: added <seed:eos> logit bias = -inf
132
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
133
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
134
+
135
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
136
+ perplexity: tokenizing the input ..
137
+ perplexity: tokenization took 49.672 ms
138
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
139
+ perplexity: 17.80 seconds per pass - ETA 4.45 minutes
140
+ [1]7.1957,[2]8.1195,[3]8.4548,[4]8.2130,[5]8.0074,[6]6.7286,[7]5.9325,[8]5.9903,[9]6.2600,[10]6.3190,[11]6.4561,[12]6.7865,[13]6.8028,[14]6.8780,[15]6.8872,
141
+ Final estimate: PPL = 6.8872 +/- 0.16794
142
+
143
+ llama_perf_context_print: load time = 8091.77 ms
144
+ llama_perf_context_print: prompt eval time = 263005.39 ms / 30720 tokens ( 8.56 ms per token, 116.80 tokens per second)
145
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
146
+ llama_perf_context_print: total time = 263735.66 ms / 30721 tokens
147
+ llama_perf_context_print: graphs reused = 0
148
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
149
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 7242 + (12208 = 10300 + 80 + 1828) + 4663 |
150
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12270 + (10574 = 10300 + 80 + 194) + 1279 |
151
+ llama_memory_breakdown_print: | - Host | 48719 = 48353 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_math.log ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 19403 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type bf16: 450 tensors
46
+ print_info: file format = GGUF V3 (latest)
47
+ print_info: file type = IQ4_NL - 4.5 bpw
48
+ print_info: file size = 67.34 GiB (16.00 BPW)
49
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
50
+ load: printing all EOG tokens:
51
+ load: - 2 ('<seed:eos>')
52
+ load: special tokens cache size = 128
53
+ load: token to piece cache size = 0.9296 MB
54
+ print_info: arch = seed_oss
55
+ print_info: vocab_only = 0
56
+ print_info: n_ctx_train = 524288
57
+ print_info: n_embd = 5120
58
+ print_info: n_embd_inp = 5120
59
+ print_info: n_layer = 64
60
+ print_info: n_head = 80
61
+ print_info: n_head_kv = 8
62
+ print_info: n_rot = 128
63
+ print_info: n_swa = 0
64
+ print_info: is_swa_any = 0
65
+ print_info: n_embd_head_k = 128
66
+ print_info: n_embd_head_v = 128
67
+ print_info: n_gqa = 10
68
+ print_info: n_embd_k_gqa = 1024
69
+ print_info: n_embd_v_gqa = 1024
70
+ print_info: f_norm_eps = 0.0e+00
71
+ print_info: f_norm_rms_eps = 1.0e-06
72
+ print_info: f_clamp_kqv = 0.0e+00
73
+ print_info: f_max_alibi_bias = 0.0e+00
74
+ print_info: f_logit_scale = 0.0e+00
75
+ print_info: f_attn_scale = 0.0e+00
76
+ print_info: n_ff = 27648
77
+ print_info: n_expert = 0
78
+ print_info: n_expert_used = 0
79
+ print_info: n_expert_groups = 0
80
+ print_info: n_group_used = 0
81
+ print_info: causal attn = 1
82
+ print_info: pooling type = 0
83
+ print_info: rope type = 2
84
+ print_info: rope scaling = linear
85
+ print_info: freq_base_train = 10000000.0
86
+ print_info: freq_scale_train = 1
87
+ print_info: n_ctx_orig_yarn = 524288
88
+ print_info: rope_finetuned = unknown
89
+ print_info: model type = 36B
90
+ print_info: model params = 36.15 B
91
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
92
+ print_info: vocab type = BPE
93
+ print_info: n_vocab = 155136
94
+ print_info: n_merges = 154737
95
+ print_info: BOS token = 0 '<seed:bos>'
96
+ print_info: EOS token = 2 '<seed:eos>'
97
+ print_info: PAD token = 1 '<seed:pad>'
98
+ print_info: LF token = 326 'Ċ'
99
+ print_info: EOG token = 2 '<seed:eos>'
100
+ print_info: max token length = 1024
101
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
102
+ load_tensors: offloading 20 repeating layers to GPU
103
+ load_tensors: offloaded 20/65 layers to GPU
104
+ load_tensors: CPU_Mapped model buffer size = 48353.80 MiB
105
+ load_tensors: CUDA0 model buffer size = 10300.86 MiB
106
+ load_tensors: CUDA1 model buffer size = 10300.86 MiB
107
+ ..................................................................................................
108
+ llama_context: constructing llama_context
109
+ llama_context: n_seq_max = 1
110
+ llama_context: n_ctx = 2048
111
+ llama_context: n_ctx_seq = 2048
112
+ llama_context: n_batch = 2048
113
+ llama_context: n_ubatch = 512
114
+ llama_context: causal_attn = 1
115
+ llama_context: flash_attn = auto
116
+ llama_context: kv_unified = false
117
+ llama_context: freq_base = 10000000.0
118
+ llama_context: freq_scale = 1
119
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
120
+ llama_context: CPU output buffer size = 0.59 MiB
121
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
122
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
123
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
125
+ llama_context: Flash Attention was auto, set to enabled
126
+ llama_context: CUDA0 compute buffer size = 1828.00 MiB
127
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
128
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
129
+ llama_context: graph nodes = 2183
130
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
131
+ common_init_from_params: added <seed:eos> logit bias = -inf
132
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
133
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
134
+
135
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
136
+ perplexity: tokenizing the input ..
137
+ perplexity: tokenization took 46.673 ms
138
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
139
+ perplexity: 17.81 seconds per pass - ETA 4.73 minutes
140
+ [1]2.6577,[2]2.8378,[3]3.2807,[4]3.5315,[5]4.0764,[6]4.3578,[7]4.5789,[8]4.7049,[9]4.8470,[10]5.0057,[11]5.0877,[12]5.1590,[13]5.2956,[14]5.4047,[15]5.4376,[16]5.4442,
141
+ Final estimate: PPL = 5.4442 +/- 0.12088
142
+
143
+ llama_perf_context_print: load time = 8172.12 ms
144
+ llama_perf_context_print: prompt eval time = 280924.96 ms / 32768 tokens ( 8.57 ms per token, 116.64 tokens per second)
145
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
146
+ llama_perf_context_print: total time = 281977.55 ms / 32769 tokens
147
+ llama_perf_context_print: graphs reused = 0
148
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
149
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 6968 + (12208 = 10300 + 80 + 1828) + 4937 |
150
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12270 + (10574 = 10300 + 80 + 194) + 1279 |
151
+ llama_memory_breakdown_print: | - Host | 48719 = 48353 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "raw_metrics": {
3
+ "llamabench": {
4
+ "backend": "CUDA",
5
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
6
+ "ngl": "35",
7
+ "raw_row": {
8
+ "backend": "CUDA",
9
+ "model": "seed_oss 36B IQ4_NL - 4.5 bpw",
10
+ "ngl": "35",
11
+ "params": "36.15 B",
12
+ "size": "18.94 GiB",
13
+ "t/s": "28.49 \u00b1 3.98",
14
+ "test": "pp8",
15
+ "tps_value": 28.49
16
+ },
17
+ "test": "pp8",
18
+ "tps": 28.49
19
+ },
20
+ "perplexity": {
21
+ "code": {
22
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
23
+ "ppl": 1.4162,
24
+ "ppl_error": 0.00948
25
+ },
26
+ "general": {
27
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
28
+ "ppl": 6.8712,
29
+ "ppl_error": 0.16544
30
+ },
31
+ "math": {
32
+ "log_path": "Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
33
+ "ppl": 5.4627,
34
+ "ppl_error": 0.12011
35
+ }
36
+ }
37
+ },
38
+ "summary": {
39
+ "avg_prec_loss_pct": 0.2709,
40
+ "bench_tps": 28.49,
41
+ "file_size_bytes": 20346264352,
42
+ "file_size_gb": 18.95
43
+ }
44
+ }
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 18.94 GiB | 36.15 B | CUDA | 35 | pp8 | 28.49 ± 3.98 |
9
+ | seed_oss 36B IQ4_NL - 4.5 bpw | 18.94 GiB | 36.15 B | CUDA | 35 | tg128 | 5.12 ± 0.01 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20217 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type iq4_nl: 450 tensors
46
+ print_info: file format = GGUF V3 (latest)
47
+ print_info: file type = IQ4_NL - 4.5 bpw
48
+ print_info: file size = 18.94 GiB (4.50 BPW)
49
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
50
+ load: printing all EOG tokens:
51
+ load: - 2 ('<seed:eos>')
52
+ load: special tokens cache size = 128
53
+ load: token to piece cache size = 0.9296 MB
54
+ print_info: arch = seed_oss
55
+ print_info: vocab_only = 0
56
+ print_info: n_ctx_train = 524288
57
+ print_info: n_embd = 5120
58
+ print_info: n_embd_inp = 5120
59
+ print_info: n_layer = 64
60
+ print_info: n_head = 80
61
+ print_info: n_head_kv = 8
62
+ print_info: n_rot = 128
63
+ print_info: n_swa = 0
64
+ print_info: is_swa_any = 0
65
+ print_info: n_embd_head_k = 128
66
+ print_info: n_embd_head_v = 128
67
+ print_info: n_gqa = 10
68
+ print_info: n_embd_k_gqa = 1024
69
+ print_info: n_embd_v_gqa = 1024
70
+ print_info: f_norm_eps = 0.0e+00
71
+ print_info: f_norm_rms_eps = 1.0e-06
72
+ print_info: f_clamp_kqv = 0.0e+00
73
+ print_info: f_max_alibi_bias = 0.0e+00
74
+ print_info: f_logit_scale = 0.0e+00
75
+ print_info: f_attn_scale = 0.0e+00
76
+ print_info: n_ff = 27648
77
+ print_info: n_expert = 0
78
+ print_info: n_expert_used = 0
79
+ print_info: n_expert_groups = 0
80
+ print_info: n_group_used = 0
81
+ print_info: causal attn = 1
82
+ print_info: pooling type = 0
83
+ print_info: rope type = 2
84
+ print_info: rope scaling = linear
85
+ print_info: freq_base_train = 10000000.0
86
+ print_info: freq_scale_train = 1
87
+ print_info: n_ctx_orig_yarn = 524288
88
+ print_info: rope_finetuned = unknown
89
+ print_info: model type = 36B
90
+ print_info: model params = 36.15 B
91
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
92
+ print_info: vocab type = BPE
93
+ print_info: n_vocab = 155136
94
+ print_info: n_merges = 154737
95
+ print_info: BOS token = 0 '<seed:bos>'
96
+ print_info: EOS token = 2 '<seed:eos>'
97
+ print_info: PAD token = 1 '<seed:pad>'
98
+ print_info: LF token = 326 'Ċ'
99
+ print_info: EOG token = 2 '<seed:eos>'
100
+ print_info: max token length = 1024
101
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
102
+ load_tensors: offloading 20 repeating layers to GPU
103
+ load_tensors: offloaded 20/65 layers to GPU
104
+ load_tensors: CPU_Mapped model buffer size = 13602.24 MiB
105
+ load_tensors: CUDA0 model buffer size = 2897.73 MiB
106
+ load_tensors: CUDA1 model buffer size = 2897.73 MiB
107
+ ..................................................................................................
108
+ llama_context: constructing llama_context
109
+ llama_context: n_seq_max = 1
110
+ llama_context: n_ctx = 2048
111
+ llama_context: n_ctx_seq = 2048
112
+ llama_context: n_batch = 2048
113
+ llama_context: n_ubatch = 512
114
+ llama_context: causal_attn = 1
115
+ llama_context: flash_attn = auto
116
+ llama_context: kv_unified = false
117
+ llama_context: freq_base = 10000000.0
118
+ llama_context: freq_scale = 1
119
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
120
+ llama_context: CPU output buffer size = 0.59 MiB
121
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
122
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
123
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
125
+ llama_context: Flash Attention was auto, set to enabled
126
+ llama_context: CUDA0 compute buffer size = 739.09 MiB
127
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
128
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
129
+ llama_context: graph nodes = 2183
130
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
131
+ common_init_from_params: added <seed:eos> logit bias = -inf
132
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
133
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
134
+
135
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
136
+ perplexity: tokenizing the input ..
137
+ perplexity: tokenization took 109.985 ms
138
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
139
+ perplexity: 6.58 seconds per pass - ETA 5.27 minutes
140
+ [1]1.5682,[2]1.4703,[3]1.2931,[4]1.2377,[5]1.1929,[6]1.2806,[7]1.3861,[8]1.4448,[9]1.4269,[10]1.4035,[11]1.3807,[12]1.3858,[13]1.3863,[14]1.3715,[15]1.3526,[16]1.3677,[17]1.3694,[18]1.3506,[19]1.3482,[20]1.3638,[21]1.3540,[22]1.3438,[23]1.3542,[24]1.3488,[25]1.3517,[26]1.3476,[27]1.3647,[28]1.3700,[29]1.3704,[30]1.3714,[31]1.3686,[32]1.3796,[33]1.3802,[34]1.3726,[35]1.3683,[36]1.3633,[37]1.3711,[38]1.3799,[39]1.3712,[40]1.3935,[41]1.4025,[42]1.4055,[43]1.4140,[44]1.4151,[45]1.4084,[46]1.4113,[47]1.4149,[48]1.4162,
141
+ Final estimate: PPL = 1.4162 +/- 0.00948
142
+
143
+ llama_perf_context_print: load time = 2451.89 ms
144
+ llama_perf_context_print: prompt eval time = 306874.12 ms / 98304 tokens ( 3.12 ms per token, 320.34 tokens per second)
145
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
146
+ llama_perf_context_print: total time = 308445.15 ms / 98305 tokens
147
+ llama_perf_context_print: graphs reused = 0
148
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
149
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16303 + ( 3716 = 2897 + 80 + 739) + 4094 |
150
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19690 + ( 3171 = 2897 + 80 + 194) + 1262 |
151
+ llama_memory_breakdown_print: | - Host | 13968 = 13602 + 352 + 14 |
Benchmarks/DataCollection/Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20064 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23060 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world8/AI/ToBench/Seed-OSS-36B-Instruct-unsloth/Magic_Quant/GGUF/dc_round0_Seed-OSS-36B-Instruct-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 25
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type iq4_nl: 450 tensors
46
+ print_info: file format = GGUF V3 (latest)
47
+ print_info: file type = IQ4_NL - 4.5 bpw
48
+ print_info: file size = 18.94 GiB (4.50 BPW)
49
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
50
+ load: printing all EOG tokens:
51
+ load: - 2 ('<seed:eos>')
52
+ load: special tokens cache size = 128
53
+ load: token to piece cache size = 0.9296 MB
54
+ print_info: arch = seed_oss
55
+ print_info: vocab_only = 0
56
+ print_info: n_ctx_train = 524288
57
+ print_info: n_embd = 5120
58
+ print_info: n_embd_inp = 5120
59
+ print_info: n_layer = 64
60
+ print_info: n_head = 80
61
+ print_info: n_head_kv = 8
62
+ print_info: n_rot = 128
63
+ print_info: n_swa = 0
64
+ print_info: is_swa_any = 0
65
+ print_info: n_embd_head_k = 128
66
+ print_info: n_embd_head_v = 128
67
+ print_info: n_gqa = 10
68
+ print_info: n_embd_k_gqa = 1024
69
+ print_info: n_embd_v_gqa = 1024
70
+ print_info: f_norm_eps = 0.0e+00
71
+ print_info: f_norm_rms_eps = 1.0e-06
72
+ print_info: f_clamp_kqv = 0.0e+00
73
+ print_info: f_max_alibi_bias = 0.0e+00
74
+ print_info: f_logit_scale = 0.0e+00
75
+ print_info: f_attn_scale = 0.0e+00
76
+ print_info: n_ff = 27648
77
+ print_info: n_expert = 0
78
+ print_info: n_expert_used = 0
79
+ print_info: n_expert_groups = 0
80
+ print_info: n_group_used = 0
81
+ print_info: causal attn = 1
82
+ print_info: pooling type = 0
83
+ print_info: rope type = 2
84
+ print_info: rope scaling = linear
85
+ print_info: freq_base_train = 10000000.0
86
+ print_info: freq_scale_train = 1
87
+ print_info: n_ctx_orig_yarn = 524288
88
+ print_info: rope_finetuned = unknown
89
+ print_info: model type = 36B
90
+ print_info: model params = 36.15 B
91
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
92
+ print_info: vocab type = BPE
93
+ print_info: n_vocab = 155136
94
+ print_info: n_merges = 154737
95
+ print_info: BOS token = 0 '<seed:bos>'
96
+ print_info: EOS token = 2 '<seed:eos>'
97
+ print_info: PAD token = 1 '<seed:pad>'
98
+ print_info: LF token = 326 'Ċ'
99
+ print_info: EOG token = 2 '<seed:eos>'
100
+ print_info: max token length = 1024
101
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
102
+ load_tensors: offloading 20 repeating layers to GPU
103
+ load_tensors: offloaded 20/65 layers to GPU
104
+ load_tensors: CPU_Mapped model buffer size = 13602.24 MiB
105
+ load_tensors: CUDA0 model buffer size = 2897.73 MiB
106
+ load_tensors: CUDA1 model buffer size = 2897.73 MiB
107
+ ..................................................................................................
108
+ llama_context: constructing llama_context
109
+ llama_context: n_seq_max = 1
110
+ llama_context: n_ctx = 2048
111
+ llama_context: n_ctx_seq = 2048
112
+ llama_context: n_batch = 2048
113
+ llama_context: n_ubatch = 512
114
+ llama_context: causal_attn = 1
115
+ llama_context: flash_attn = auto
116
+ llama_context: kv_unified = false
117
+ llama_context: freq_base = 10000000.0
118
+ llama_context: freq_scale = 1
119
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
120
+ llama_context: CPU output buffer size = 0.59 MiB
121
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
122
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
123
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
125
+ llama_context: Flash Attention was auto, set to enabled
126
+ llama_context: CUDA0 compute buffer size = 739.09 MiB
127
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
128
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
129
+ llama_context: graph nodes = 2183
130
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
131
+ common_init_from_params: added <seed:eos> logit bias = -inf
132
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
133
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
134
+
135
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
136
+ perplexity: tokenizing the input ..
137
+ perplexity: tokenization took 50.71 ms
138
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
139
+ perplexity: 6.57 seconds per pass - ETA 1.63 minutes
140
+ [1]7.0271,[2]8.1273,[3]8.5259,[4]8.2753,[5]8.0705,[6]6.7617,[7]5.9479,[8]6.0034,[9]6.2604,[10]6.3227,[11]6.4549,[12]6.7596,[13]6.7839,[14]6.8638,[15]6.8712,
141
+ Final estimate: PPL = 6.8712 +/- 0.16544
142
+
143
+ llama_perf_context_print: load time = 2474.98 ms
144
+ llama_perf_context_print: prompt eval time = 95401.62 ms / 30720 tokens ( 3.11 ms per token, 322.01 tokens per second)
145
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
146
+ llama_perf_context_print: total time = 95899.95 ms / 30721 tokens
147
+ llama_perf_context_print: graphs reused = 0
148
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
149
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24115 = 16309 + ( 3716 = 2897 + 80 + 739) + 4089 |
150
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 19690 + ( 3171 = 2897 + 80 + 194) + 1262 |
151
+ llama_memory_breakdown_print: | - Host | 13968 = 13602 + 352 + 14 |