Commit
·
f5a18c6
1
Parent(s):
ec62fe2
Add GGUF models
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +2 -0
- .obsidian/app.json +1 -0
- .obsidian/appearance.json +1 -0
- .obsidian/core-plugins.json +33 -0
- .obsidian/graph.json +22 -0
- .obsidian/workspace.json +203 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q8.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q5_K.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q8.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-router_gate_emb_q4_K.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-router_gate_emb_q5_K.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-router_gate_emb_q8.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_q8-embd_mxfp4.gguf +3 -0
- Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE.gguf +3 -0
- Benchmarks/Qwen3-4B-Instruct-2507-F16/llamabench.txt +11 -0
- Benchmarks/Qwen3-4B-Instruct-2507-F16/perplexity_code.txt +168 -0
- Benchmarks/Qwen3-4B-Instruct-2507-F16/perplexity_general.txt +168 -0
- Benchmarks/Qwen3-4B-Instruct-2507-F16/perplexity_math.txt +168 -0
- Benchmarks/Qwen3-4B-Instruct-2507-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/llamabench.txt +11 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/perplexity_code.txt +169 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/perplexity_general.txt +169 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/perplexity_math.txt +169 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/llamabench.txt +11 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/perplexity_code.txt +169 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/perplexity_general.txt +169 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/perplexity_math.txt +169 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/llamabench.txt +11 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/perplexity_code.txt +169 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/perplexity_general.txt +169 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/perplexity_math.txt +169 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/ppl_corpus_math.txt +0 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q6_K/llamabench.txt +11 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q6_K/perplexity_code.txt +169 -0
- Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q6_K/perplexity_general.txt +169 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*.gguf filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
.obsidian/app.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{}
|
.obsidian/appearance.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{}
|
.obsidian/core-plugins.json
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"file-explorer": true,
|
| 3 |
+
"global-search": true,
|
| 4 |
+
"switcher": true,
|
| 5 |
+
"graph": true,
|
| 6 |
+
"backlink": true,
|
| 7 |
+
"canvas": true,
|
| 8 |
+
"outgoing-link": true,
|
| 9 |
+
"tag-pane": true,
|
| 10 |
+
"footnotes": false,
|
| 11 |
+
"properties": true,
|
| 12 |
+
"page-preview": true,
|
| 13 |
+
"daily-notes": true,
|
| 14 |
+
"templates": true,
|
| 15 |
+
"note-composer": true,
|
| 16 |
+
"command-palette": true,
|
| 17 |
+
"slash-command": false,
|
| 18 |
+
"editor-status": true,
|
| 19 |
+
"bookmarks": true,
|
| 20 |
+
"markdown-importer": false,
|
| 21 |
+
"zk-prefixer": false,
|
| 22 |
+
"random-note": false,
|
| 23 |
+
"outline": true,
|
| 24 |
+
"word-count": true,
|
| 25 |
+
"slides": false,
|
| 26 |
+
"audio-recorder": false,
|
| 27 |
+
"workspaces": false,
|
| 28 |
+
"file-recovery": true,
|
| 29 |
+
"publish": false,
|
| 30 |
+
"sync": true,
|
| 31 |
+
"bases": true,
|
| 32 |
+
"webviewer": false
|
| 33 |
+
}
|
.obsidian/graph.json
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"collapse-filter": true,
|
| 3 |
+
"search": "",
|
| 4 |
+
"showTags": false,
|
| 5 |
+
"showAttachments": false,
|
| 6 |
+
"hideUnresolved": false,
|
| 7 |
+
"showOrphans": true,
|
| 8 |
+
"collapse-color-groups": true,
|
| 9 |
+
"colorGroups": [],
|
| 10 |
+
"collapse-display": true,
|
| 11 |
+
"showArrow": false,
|
| 12 |
+
"textFadeMultiplier": 0,
|
| 13 |
+
"nodeSizeMultiplier": 1,
|
| 14 |
+
"lineSizeMultiplier": 1,
|
| 15 |
+
"collapse-forces": true,
|
| 16 |
+
"centerStrength": 0.518713248970312,
|
| 17 |
+
"repelStrength": 10,
|
| 18 |
+
"linkStrength": 1,
|
| 19 |
+
"linkDistance": 250,
|
| 20 |
+
"scale": 1,
|
| 21 |
+
"close": true
|
| 22 |
+
}
|
.obsidian/workspace.json
ADDED
|
@@ -0,0 +1,203 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"main": {
|
| 3 |
+
"id": "e3ac5c2a4b2edb0a",
|
| 4 |
+
"type": "split",
|
| 5 |
+
"children": [
|
| 6 |
+
{
|
| 7 |
+
"id": "f729f51cdc518b04",
|
| 8 |
+
"type": "tabs",
|
| 9 |
+
"children": [
|
| 10 |
+
{
|
| 11 |
+
"id": "2dde3813c2b56127",
|
| 12 |
+
"type": "leaf",
|
| 13 |
+
"state": {
|
| 14 |
+
"type": "markdown",
|
| 15 |
+
"state": {
|
| 16 |
+
"file": "README.md",
|
| 17 |
+
"mode": "source",
|
| 18 |
+
"source": false
|
| 19 |
+
},
|
| 20 |
+
"icon": "lucide-file",
|
| 21 |
+
"title": "README"
|
| 22 |
+
}
|
| 23 |
+
}
|
| 24 |
+
]
|
| 25 |
+
}
|
| 26 |
+
],
|
| 27 |
+
"direction": "vertical"
|
| 28 |
+
},
|
| 29 |
+
"left": {
|
| 30 |
+
"id": "2ad07ef459c4ec0a",
|
| 31 |
+
"type": "split",
|
| 32 |
+
"children": [
|
| 33 |
+
{
|
| 34 |
+
"id": "ab44f29db6fa7569",
|
| 35 |
+
"type": "tabs",
|
| 36 |
+
"children": [
|
| 37 |
+
{
|
| 38 |
+
"id": "4b8e626ccbf646a1",
|
| 39 |
+
"type": "leaf",
|
| 40 |
+
"state": {
|
| 41 |
+
"type": "file-explorer",
|
| 42 |
+
"state": {
|
| 43 |
+
"sortOrder": "alphabetical",
|
| 44 |
+
"autoReveal": false
|
| 45 |
+
},
|
| 46 |
+
"icon": "lucide-folder-closed",
|
| 47 |
+
"title": "Files"
|
| 48 |
+
}
|
| 49 |
+
},
|
| 50 |
+
{
|
| 51 |
+
"id": "5018048bad134c20",
|
| 52 |
+
"type": "leaf",
|
| 53 |
+
"state": {
|
| 54 |
+
"type": "search",
|
| 55 |
+
"state": {
|
| 56 |
+
"query": "",
|
| 57 |
+
"matchingCase": false,
|
| 58 |
+
"explainSearch": false,
|
| 59 |
+
"collapseAll": false,
|
| 60 |
+
"extraContext": false,
|
| 61 |
+
"sortOrder": "alphabetical"
|
| 62 |
+
},
|
| 63 |
+
"icon": "lucide-search",
|
| 64 |
+
"title": "Search"
|
| 65 |
+
}
|
| 66 |
+
},
|
| 67 |
+
{
|
| 68 |
+
"id": "9a87119231678a34",
|
| 69 |
+
"type": "leaf",
|
| 70 |
+
"state": {
|
| 71 |
+
"type": "bookmarks",
|
| 72 |
+
"state": {},
|
| 73 |
+
"icon": "lucide-bookmark",
|
| 74 |
+
"title": "Bookmarks"
|
| 75 |
+
}
|
| 76 |
+
}
|
| 77 |
+
]
|
| 78 |
+
}
|
| 79 |
+
],
|
| 80 |
+
"direction": "horizontal",
|
| 81 |
+
"width": 300
|
| 82 |
+
},
|
| 83 |
+
"right": {
|
| 84 |
+
"id": "cb56c6224010d53d",
|
| 85 |
+
"type": "split",
|
| 86 |
+
"children": [
|
| 87 |
+
{
|
| 88 |
+
"id": "53b11fa01562d3f7",
|
| 89 |
+
"type": "tabs",
|
| 90 |
+
"children": [
|
| 91 |
+
{
|
| 92 |
+
"id": "be8822f49a5db80f",
|
| 93 |
+
"type": "leaf",
|
| 94 |
+
"state": {
|
| 95 |
+
"type": "backlink",
|
| 96 |
+
"state": {
|
| 97 |
+
"file": "Welcome.md",
|
| 98 |
+
"collapseAll": false,
|
| 99 |
+
"extraContext": false,
|
| 100 |
+
"sortOrder": "alphabetical",
|
| 101 |
+
"showSearch": false,
|
| 102 |
+
"searchQuery": "",
|
| 103 |
+
"backlinkCollapsed": false,
|
| 104 |
+
"unlinkedCollapsed": true
|
| 105 |
+
},
|
| 106 |
+
"icon": "links-coming-in",
|
| 107 |
+
"title": "Backlinks for Welcome"
|
| 108 |
+
}
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
"id": "5141b3861c995666",
|
| 112 |
+
"type": "leaf",
|
| 113 |
+
"state": {
|
| 114 |
+
"type": "outgoing-link",
|
| 115 |
+
"state": {
|
| 116 |
+
"file": "Welcome.md",
|
| 117 |
+
"linksCollapsed": false,
|
| 118 |
+
"unlinkedCollapsed": true
|
| 119 |
+
},
|
| 120 |
+
"icon": "links-going-out",
|
| 121 |
+
"title": "Outgoing links from Welcome"
|
| 122 |
+
}
|
| 123 |
+
},
|
| 124 |
+
{
|
| 125 |
+
"id": "555d8eee1b9906dd",
|
| 126 |
+
"type": "leaf",
|
| 127 |
+
"state": {
|
| 128 |
+
"type": "tag",
|
| 129 |
+
"state": {
|
| 130 |
+
"sortOrder": "frequency",
|
| 131 |
+
"useHierarchy": true,
|
| 132 |
+
"showSearch": false,
|
| 133 |
+
"searchQuery": ""
|
| 134 |
+
},
|
| 135 |
+
"icon": "lucide-tags",
|
| 136 |
+
"title": "Tags"
|
| 137 |
+
}
|
| 138 |
+
},
|
| 139 |
+
{
|
| 140 |
+
"id": "41beaac18fc8e209",
|
| 141 |
+
"type": "leaf",
|
| 142 |
+
"state": {
|
| 143 |
+
"type": "all-properties",
|
| 144 |
+
"state": {
|
| 145 |
+
"sortOrder": "frequency",
|
| 146 |
+
"showSearch": false,
|
| 147 |
+
"searchQuery": ""
|
| 148 |
+
},
|
| 149 |
+
"icon": "lucide-archive",
|
| 150 |
+
"title": "All properties"
|
| 151 |
+
}
|
| 152 |
+
},
|
| 153 |
+
{
|
| 154 |
+
"id": "72d8cd09d0e7f99e",
|
| 155 |
+
"type": "leaf",
|
| 156 |
+
"state": {
|
| 157 |
+
"type": "outline",
|
| 158 |
+
"state": {
|
| 159 |
+
"file": "Welcome.md",
|
| 160 |
+
"followCursor": false,
|
| 161 |
+
"showSearch": false,
|
| 162 |
+
"searchQuery": ""
|
| 163 |
+
},
|
| 164 |
+
"icon": "lucide-list",
|
| 165 |
+
"title": "Outline of Welcome"
|
| 166 |
+
}
|
| 167 |
+
}
|
| 168 |
+
]
|
| 169 |
+
}
|
| 170 |
+
],
|
| 171 |
+
"direction": "horizontal",
|
| 172 |
+
"width": 300,
|
| 173 |
+
"collapsed": true
|
| 174 |
+
},
|
| 175 |
+
"left-ribbon": {
|
| 176 |
+
"hiddenItems": {
|
| 177 |
+
"switcher:Open quick switcher": false,
|
| 178 |
+
"graph:Open graph view": false,
|
| 179 |
+
"canvas:Create new canvas": false,
|
| 180 |
+
"daily-notes:Open today's daily note": false,
|
| 181 |
+
"templates:Insert template": false,
|
| 182 |
+
"command-palette:Open command palette": false,
|
| 183 |
+
"bases:Create new base": false
|
| 184 |
+
}
|
| 185 |
+
},
|
| 186 |
+
"active": "2dde3813c2b56127",
|
| 187 |
+
"lastOpenFiles": [
|
| 188 |
+
"Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-router_gate_emb_q4_K.gguf",
|
| 189 |
+
"Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-router_gate_emb_q5_K.gguf",
|
| 190 |
+
"Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-router_gate_emb_q8.gguf",
|
| 191 |
+
"Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf",
|
| 192 |
+
"Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q5_K.gguf",
|
| 193 |
+
"Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf",
|
| 194 |
+
"Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K.gguf",
|
| 195 |
+
"Benchmarks/Old Test/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q5_K/perplexity.txt",
|
| 196 |
+
"Benchmarks/Old Test/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q5_K/llamabench.txt",
|
| 197 |
+
"Benchmarks/Old Test/Qwen3-4B-Instruct-2507-Q5_K_M/perplexity.txt",
|
| 198 |
+
"Benchmarks/Old Test/Qwen3-4B-Instruct-2507-Q5_K_M/llamabench.txt",
|
| 199 |
+
"Benchmarks/Benchmarks.md",
|
| 200 |
+
"README.md",
|
| 201 |
+
"Welcome.md"
|
| 202 |
+
]
|
| 203 |
+
}
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d3dcf49edd2d410940ce5ccf5fcb72d6c2b9309828d7ea923db31e50112ee68b
|
| 3 |
+
size 4998944320
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:16443b7276ee7102ecab6aae240dc5c79e6306063677a26220a14a720c69e80b
|
| 3 |
+
size 3897181760
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7c40a452a354fbef11c34011df1d2101864f3fca9176e4e2a819563961e7b8de
|
| 3 |
+
size 3992987200
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q8.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ac385d2810aaa556161a02befa80af08d2d7a98fcb62df3400de75784adb4ccb
|
| 3 |
+
size 4280403520
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:137aa0fe67dfb1a75cc626daa02b03832952593d4bf83ca4e674405bcf7ca704
|
| 3 |
+
size 3885385280
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q5_K.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4250838c84cee6c50356e2252398de34951ba3513746e86830dc91b84e61985d
|
| 3 |
+
size 3934004800
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f7d74f2754ddfb62707bf649d6e9f913267cd1594f77c9746a468a961048f8a5
|
| 3 |
+
size 3985663040
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-embd_q8.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8147bef96d4170bba2240ca8f09baf1361847480f43b7d043d7bfcc9de659d96
|
| 3 |
+
size 4079863360
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-router_gate_emb_q4_K.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8d531e0b45c5ef847ab2335159219a9d3acad6f37d4ae17643144f0c0ed71f37
|
| 3 |
+
size 3437119040
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-router_gate_emb_q5_K.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:55d4ee81cab27bb2ed72eed2e68c82c2f3d3a900508ad6730d1690952968c2c3
|
| 3 |
+
size 3597805120
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_mxfp4-router_gate_emb_q8.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8147bef96d4170bba2240ca8f09baf1361847480f43b7d043d7bfcc9de659d96
|
| 3 |
+
size 4079863360
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE-output_q8-embd_mxfp4.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c0cdcbe16bdd4887c57f566ed17e9481a5178ae39ff7063886dd77e828e2339b
|
| 3 |
+
size 4073770560
|
Bad-Hybrid-Models/Qwen3-4B-Instruct-2507-MXFP4_MOE.gguf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a1fae3cbc6a5571061e7bd554cd12bf5f3bed22bbe499f3bb1acec312b207fde
|
| 3 |
+
size 2143571520
|
Benchmarks/Qwen3-4B-Instruct-2507-F16/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B F16 | 7.49 GiB | 4.02 B | CUDA | 35 | pp8 | 263.21 ± 13.57 |
|
| 9 |
+
| qwen3 4B F16 | 7.49 GiB | 4.02 B | CUDA | 35 | tg128 | 34.98 ± 0.28 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-4B-Instruct-2507-F16/perplexity_code.txt
ADDED
|
@@ -0,0 +1,168 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21040 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: general.file_type u32 = 1
|
| 32 |
+
llama_model_loader: - kv 21: general.quantization_version u32 = 2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151643
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 151643
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = false
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type f16: 253 tensors
|
| 45 |
+
print_info: file format = GGUF V3 (latest)
|
| 46 |
+
print_info: file type = F16
|
| 47 |
+
print_info: file size = 7.49 GiB (16.00 BPW)
|
| 48 |
+
load: printing all EOG tokens:
|
| 49 |
+
load: - 151643 ('<|endoftext|>')
|
| 50 |
+
load: - 151645 ('<|im_end|>')
|
| 51 |
+
load: - 151662 ('<|fim_pad|>')
|
| 52 |
+
load: - 151663 ('<|repo_name|>')
|
| 53 |
+
load: - 151664 ('<|file_sep|>')
|
| 54 |
+
load: special tokens cache size = 26
|
| 55 |
+
load: token to piece cache size = 0.9311 MB
|
| 56 |
+
print_info: arch = qwen3
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 262144
|
| 59 |
+
print_info: n_embd = 2560
|
| 60 |
+
print_info: n_embd_inp = 2560
|
| 61 |
+
print_info: n_layer = 36
|
| 62 |
+
print_info: n_head = 32
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 4
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 9728
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = -1
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 5000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 4B
|
| 92 |
+
print_info: model params = 4.02 B
|
| 93 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 151936
|
| 96 |
+
print_info: n_merges = 151387
|
| 97 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 98 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 99 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 101 |
+
print_info: LF token = 198 'Ċ'
|
| 102 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 103 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 104 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 105 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 106 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 107 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 108 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 109 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 110 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: max token length = 256
|
| 114 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 115 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 116 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 117 |
+
load_tensors: CPU_Mapped model buffer size = 7672.62 MiB
|
| 118 |
+
load_tensors: CUDA0 model buffer size = 1925.21 MiB
|
| 119 |
+
load_tensors: CUDA1 model buffer size = 1925.21 MiB
|
| 120 |
+
............................................................................................
|
| 121 |
+
llama_context: constructing llama_context
|
| 122 |
+
llama_context: n_seq_max = 1
|
| 123 |
+
llama_context: n_ctx = 2048
|
| 124 |
+
llama_context: n_ctx_seq = 2048
|
| 125 |
+
llama_context: n_batch = 2048
|
| 126 |
+
llama_context: n_ubatch = 512
|
| 127 |
+
llama_context: causal_attn = 1
|
| 128 |
+
llama_context: flash_attn = auto
|
| 129 |
+
llama_context: kv_unified = false
|
| 130 |
+
llama_context: freq_base = 5000000.0
|
| 131 |
+
llama_context: freq_scale = 1
|
| 132 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 133 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 134 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 135 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 138 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 139 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 140 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 141 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 142 |
+
llama_context: graph nodes = 1267
|
| 143 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 144 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 145 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 149 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 150 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 151 |
+
|
| 152 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 153 |
+
perplexity: tokenizing the input ..
|
| 154 |
+
perplexity: tokenization took 111.633 ms
|
| 155 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 156 |
+
perplexity: 1.80 seconds per pass - ETA 1.32 minutes
|
| 157 |
+
[1]3.1330,[2]2.4635,[3]1.8249,[4]1.6820,[5]1.7988,[6]1.8510,[7]1.8059,[8]1.7774,[9]1.6957,[10]1.6402,[11]1.6073,[12]1.6087,[13]1.5751,[14]1.5527,[15]1.5744,[16]1.5528,[17]1.5399,[18]1.5467,[19]1.5324,[20]1.5127,[21]1.5048,[22]1.5012,[23]1.5227,[24]1.5099,[25]1.5155,[26]1.4982,[27]1.4890,[28]1.4873,[29]1.5028,[30]1.5063,[31]1.4964,[32]1.4857,[33]1.4885,[34]1.4859,[35]1.4852,[36]1.5122,[37]1.5224,[38]1.5281,[39]1.5354,[40]1.5367,[41]1.5302,[42]1.5444,[43]1.5459,[44]1.5468,
|
| 158 |
+
Final estimate: PPL = 1.5468 +/- 0.01221
|
| 159 |
+
|
| 160 |
+
llama_perf_context_print: load time = 1168.83 ms
|
| 161 |
+
llama_perf_context_print: prompt eval time = 68545.08 ms / 90112 tokens ( 0.76 ms per token, 1314.64 tokens per second)
|
| 162 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 163 |
+
llama_perf_context_print: total time = 69789.86 ms / 90113 tokens
|
| 164 |
+
llama_perf_context_print: graphs reused = 0
|
| 165 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 166 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17789 + (3048 = 1925 + 80 + 1043) + 3268 |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20869 + (2079 = 1925 + 80 + 74) + 1175 |
|
| 168 |
+
llama_memory_breakdown_print: | - Host | 7809 = 7672 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-F16/perplexity_general.txt
ADDED
|
@@ -0,0 +1,168 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21040 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: general.file_type u32 = 1
|
| 32 |
+
llama_model_loader: - kv 21: general.quantization_version u32 = 2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151643
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 151643
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = false
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type f16: 253 tensors
|
| 45 |
+
print_info: file format = GGUF V3 (latest)
|
| 46 |
+
print_info: file type = F16
|
| 47 |
+
print_info: file size = 7.49 GiB (16.00 BPW)
|
| 48 |
+
load: printing all EOG tokens:
|
| 49 |
+
load: - 151643 ('<|endoftext|>')
|
| 50 |
+
load: - 151645 ('<|im_end|>')
|
| 51 |
+
load: - 151662 ('<|fim_pad|>')
|
| 52 |
+
load: - 151663 ('<|repo_name|>')
|
| 53 |
+
load: - 151664 ('<|file_sep|>')
|
| 54 |
+
load: special tokens cache size = 26
|
| 55 |
+
load: token to piece cache size = 0.9311 MB
|
| 56 |
+
print_info: arch = qwen3
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 262144
|
| 59 |
+
print_info: n_embd = 2560
|
| 60 |
+
print_info: n_embd_inp = 2560
|
| 61 |
+
print_info: n_layer = 36
|
| 62 |
+
print_info: n_head = 32
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 4
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 9728
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = -1
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 5000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 4B
|
| 92 |
+
print_info: model params = 4.02 B
|
| 93 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 151936
|
| 96 |
+
print_info: n_merges = 151387
|
| 97 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 98 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 99 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 101 |
+
print_info: LF token = 198 'Ċ'
|
| 102 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 103 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 104 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 105 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 106 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 107 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 108 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 109 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 110 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: max token length = 256
|
| 114 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 115 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 116 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 117 |
+
load_tensors: CPU_Mapped model buffer size = 7672.62 MiB
|
| 118 |
+
load_tensors: CUDA0 model buffer size = 1925.21 MiB
|
| 119 |
+
load_tensors: CUDA1 model buffer size = 1925.21 MiB
|
| 120 |
+
............................................................................................
|
| 121 |
+
llama_context: constructing llama_context
|
| 122 |
+
llama_context: n_seq_max = 1
|
| 123 |
+
llama_context: n_ctx = 2048
|
| 124 |
+
llama_context: n_ctx_seq = 2048
|
| 125 |
+
llama_context: n_batch = 2048
|
| 126 |
+
llama_context: n_ubatch = 512
|
| 127 |
+
llama_context: causal_attn = 1
|
| 128 |
+
llama_context: flash_attn = auto
|
| 129 |
+
llama_context: kv_unified = false
|
| 130 |
+
llama_context: freq_base = 5000000.0
|
| 131 |
+
llama_context: freq_scale = 1
|
| 132 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 133 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 134 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 135 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 138 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 139 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 140 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 141 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 142 |
+
llama_context: graph nodes = 1267
|
| 143 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 144 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 145 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 149 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 150 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 151 |
+
|
| 152 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 153 |
+
perplexity: tokenizing the input ..
|
| 154 |
+
perplexity: tokenization took 47.408 ms
|
| 155 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 156 |
+
perplexity: 1.76 seconds per pass - ETA 0.43 minutes
|
| 157 |
+
[1]8.2666,[2]10.3709,[3]10.7346,[4]10.3950,[5]10.1135,[6]8.6300,[7]7.7462,[8]7.7280,[9]8.1497,[10]8.2853,[11]8.3091,[12]8.6420,[13]8.6787,[14]8.8131,[15]8.8841,
|
| 158 |
+
Final estimate: PPL = 8.8841 +/- 0.20561
|
| 159 |
+
|
| 160 |
+
llama_perf_context_print: load time = 1175.84 ms
|
| 161 |
+
llama_perf_context_print: prompt eval time = 23615.66 ms / 30720 tokens ( 0.77 ms per token, 1300.83 tokens per second)
|
| 162 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 163 |
+
llama_perf_context_print: total time = 24053.37 ms / 30721 tokens
|
| 164 |
+
llama_perf_context_print: graphs reused = 0
|
| 165 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 166 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17794 + (3048 = 1925 + 80 + 1043) + 3264 |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20869 + (2079 = 1925 + 80 + 74) + 1175 |
|
| 168 |
+
llama_memory_breakdown_print: | - Host | 7809 = 7672 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-F16/perplexity_math.txt
ADDED
|
@@ -0,0 +1,168 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21035 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: general.file_type u32 = 1
|
| 32 |
+
llama_model_loader: - kv 21: general.quantization_version u32 = 2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = qwen2
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151643
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 151643
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = false
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type f16: 253 tensors
|
| 45 |
+
print_info: file format = GGUF V3 (latest)
|
| 46 |
+
print_info: file type = F16
|
| 47 |
+
print_info: file size = 7.49 GiB (16.00 BPW)
|
| 48 |
+
load: printing all EOG tokens:
|
| 49 |
+
load: - 151643 ('<|endoftext|>')
|
| 50 |
+
load: - 151645 ('<|im_end|>')
|
| 51 |
+
load: - 151662 ('<|fim_pad|>')
|
| 52 |
+
load: - 151663 ('<|repo_name|>')
|
| 53 |
+
load: - 151664 ('<|file_sep|>')
|
| 54 |
+
load: special tokens cache size = 26
|
| 55 |
+
load: token to piece cache size = 0.9311 MB
|
| 56 |
+
print_info: arch = qwen3
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 262144
|
| 59 |
+
print_info: n_embd = 2560
|
| 60 |
+
print_info: n_embd_inp = 2560
|
| 61 |
+
print_info: n_layer = 36
|
| 62 |
+
print_info: n_head = 32
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 4
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 9728
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = -1
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 5000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 4B
|
| 92 |
+
print_info: model params = 4.02 B
|
| 93 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 151936
|
| 96 |
+
print_info: n_merges = 151387
|
| 97 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 98 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 99 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 101 |
+
print_info: LF token = 198 'Ċ'
|
| 102 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 103 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 104 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 105 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 106 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 107 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 108 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 109 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 110 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 111 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 112 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 113 |
+
print_info: max token length = 256
|
| 114 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 115 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 116 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 117 |
+
load_tensors: CPU_Mapped model buffer size = 7672.62 MiB
|
| 118 |
+
load_tensors: CUDA0 model buffer size = 1925.21 MiB
|
| 119 |
+
load_tensors: CUDA1 model buffer size = 1925.21 MiB
|
| 120 |
+
............................................................................................
|
| 121 |
+
llama_context: constructing llama_context
|
| 122 |
+
llama_context: n_seq_max = 1
|
| 123 |
+
llama_context: n_ctx = 2048
|
| 124 |
+
llama_context: n_ctx_seq = 2048
|
| 125 |
+
llama_context: n_batch = 2048
|
| 126 |
+
llama_context: n_ubatch = 512
|
| 127 |
+
llama_context: causal_attn = 1
|
| 128 |
+
llama_context: flash_attn = auto
|
| 129 |
+
llama_context: kv_unified = false
|
| 130 |
+
llama_context: freq_base = 5000000.0
|
| 131 |
+
llama_context: freq_scale = 1
|
| 132 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 133 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 134 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 135 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 138 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 139 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 140 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 141 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 142 |
+
llama_context: graph nodes = 1267
|
| 143 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 144 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 145 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 149 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 150 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 151 |
+
|
| 152 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 153 |
+
perplexity: tokenizing the input ..
|
| 154 |
+
perplexity: tokenization took 46.35 ms
|
| 155 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 156 |
+
perplexity: 1.82 seconds per pass - ETA 0.48 minutes
|
| 157 |
+
[1]5.5402,[2]6.1683,[3]6.4254,[4]6.5983,[5]6.8415,[6]6.7832,[7]6.7378,[8]6.6388,[9]6.6630,[10]6.6122,[11]6.6512,[12]6.6358,[13]6.7218,[14]6.7256,[15]6.7242,[16]6.7111,
|
| 158 |
+
Final estimate: PPL = 6.7111 +/- 0.13698
|
| 159 |
+
|
| 160 |
+
llama_perf_context_print: load time = 1195.14 ms
|
| 161 |
+
llama_perf_context_print: prompt eval time = 24892.95 ms / 32768 tokens ( 0.76 ms per token, 1316.36 tokens per second)
|
| 162 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 163 |
+
llama_perf_context_print: total time = 25349.44 ms / 32769 tokens
|
| 164 |
+
llama_perf_context_print: graphs reused = 0
|
| 165 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 166 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 17789 + (3048 = 1925 + 80 + 1043) + 3268 |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 20869 + (2079 = 1925 + 80 + 74) + 1175 |
|
| 168 |
+
llama_memory_breakdown_print: | - Host | 7809 = 7672 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-F16/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-F16/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-F16/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B MXFP4 MoE | 4.65 GiB | 4.02 B | CUDA | 35 | pp8 | 282.93 ± 8.32 |
|
| 9 |
+
| qwen3 4B MXFP4 MoE | 4.65 GiB | 4.02 B | CUDA | 35 | tg128 | 41.98 ± 0.51 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/perplexity_code.txt
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20945 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
|
| 32 |
+
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 41 |
+
llama_model_loader: - kv 30: general.quantization_version u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: general.file_type u32 = 38
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type f16: 37 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 4.65 GiB (9.93 BPW)
|
| 49 |
+
load: printing all EOG tokens:
|
| 50 |
+
load: - 151643 ('<|endoftext|>')
|
| 51 |
+
load: - 151645 ('<|im_end|>')
|
| 52 |
+
load: - 151662 ('<|fim_pad|>')
|
| 53 |
+
load: - 151663 ('<|repo_name|>')
|
| 54 |
+
load: - 151664 ('<|file_sep|>')
|
| 55 |
+
load: special tokens cache size = 26
|
| 56 |
+
load: token to piece cache size = 0.9311 MB
|
| 57 |
+
print_info: arch = qwen3
|
| 58 |
+
print_info: vocab_only = 0
|
| 59 |
+
print_info: n_ctx_train = 262144
|
| 60 |
+
print_info: n_embd = 2560
|
| 61 |
+
print_info: n_embd_inp = 2560
|
| 62 |
+
print_info: n_layer = 36
|
| 63 |
+
print_info: n_head = 32
|
| 64 |
+
print_info: n_head_kv = 8
|
| 65 |
+
print_info: n_rot = 128
|
| 66 |
+
print_info: n_swa = 0
|
| 67 |
+
print_info: is_swa_any = 0
|
| 68 |
+
print_info: n_embd_head_k = 128
|
| 69 |
+
print_info: n_embd_head_v = 128
|
| 70 |
+
print_info: n_gqa = 4
|
| 71 |
+
print_info: n_embd_k_gqa = 1024
|
| 72 |
+
print_info: n_embd_v_gqa = 1024
|
| 73 |
+
print_info: f_norm_eps = 0.0e+00
|
| 74 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 75 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 76 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 77 |
+
print_info: f_logit_scale = 0.0e+00
|
| 78 |
+
print_info: f_attn_scale = 0.0e+00
|
| 79 |
+
print_info: n_ff = 9728
|
| 80 |
+
print_info: n_expert = 0
|
| 81 |
+
print_info: n_expert_used = 0
|
| 82 |
+
print_info: n_expert_groups = 0
|
| 83 |
+
print_info: n_group_used = 0
|
| 84 |
+
print_info: causal attn = 1
|
| 85 |
+
print_info: pooling type = -1
|
| 86 |
+
print_info: rope type = 2
|
| 87 |
+
print_info: rope scaling = linear
|
| 88 |
+
print_info: freq_base_train = 5000000.0
|
| 89 |
+
print_info: freq_scale_train = 1
|
| 90 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 91 |
+
print_info: rope_finetuned = unknown
|
| 92 |
+
print_info: model type = 4B
|
| 93 |
+
print_info: model params = 4.02 B
|
| 94 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 95 |
+
print_info: vocab type = BPE
|
| 96 |
+
print_info: n_vocab = 151936
|
| 97 |
+
print_info: n_merges = 151387
|
| 98 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 99 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 101 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 102 |
+
print_info: LF token = 198 'Ċ'
|
| 103 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 104 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 105 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 106 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 107 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 108 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 109 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 110 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 111 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: max token length = 256
|
| 115 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 116 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 117 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 118 |
+
load_tensors: CPU_Mapped model buffer size = 2528.46 MiB
|
| 119 |
+
load_tensors: CUDA0 model buffer size = 1116.61 MiB
|
| 120 |
+
load_tensors: CUDA1 model buffer size = 1116.61 MiB
|
| 121 |
+
......................................................................................
|
| 122 |
+
llama_context: constructing llama_context
|
| 123 |
+
llama_context: n_seq_max = 1
|
| 124 |
+
llama_context: n_ctx = 2048
|
| 125 |
+
llama_context: n_ctx_seq = 2048
|
| 126 |
+
llama_context: n_batch = 2048
|
| 127 |
+
llama_context: n_ubatch = 512
|
| 128 |
+
llama_context: causal_attn = 1
|
| 129 |
+
llama_context: flash_attn = auto
|
| 130 |
+
llama_context: kv_unified = false
|
| 131 |
+
llama_context: freq_base = 5000000.0
|
| 132 |
+
llama_context: freq_scale = 1
|
| 133 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 134 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 135 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 138 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 139 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 140 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 141 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 142 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 143 |
+
llama_context: graph nodes = 1267
|
| 144 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 145 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 150 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 151 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 152 |
+
|
| 153 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 154 |
+
perplexity: tokenizing the input ..
|
| 155 |
+
perplexity: tokenization took 111.324 ms
|
| 156 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 157 |
+
perplexity: 1.38 seconds per pass - ETA 1.00 minutes
|
| 158 |
+
[1]3.1404,[2]2.4650,[3]1.8256,[4]1.6819,[5]1.7997,[6]1.8517,[7]1.8071,[8]1.7788,[9]1.6968,[10]1.6414,[11]1.6084,[12]1.6099,[13]1.5761,[14]1.5537,[15]1.5755,[16]1.5536,[17]1.5409,[18]1.5477,[19]1.5336,[20]1.5141,[21]1.5063,[22]1.5027,[23]1.5241,[24]1.5111,[25]1.5166,[26]1.4992,[27]1.4899,[28]1.4883,[29]1.5039,[30]1.5074,[31]1.4975,[32]1.4868,[33]1.4895,[34]1.4870,[35]1.4863,[36]1.5133,[37]1.5236,[38]1.5293,[39]1.5366,[40]1.5378,[41]1.5313,[42]1.5456,[43]1.5471,[44]1.5480,
|
| 159 |
+
Final estimate: PPL = 1.5480 +/- 0.01225
|
| 160 |
+
|
| 161 |
+
llama_perf_context_print: load time = 900.08 ms
|
| 162 |
+
llama_perf_context_print: prompt eval time = 49668.17 ms / 90112 tokens ( 0.55 ms per token, 1814.28 tokens per second)
|
| 163 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 164 |
+
llama_perf_context_print: total time = 50897.77 ms / 90113 tokens
|
| 165 |
+
llama_perf_context_print: graphs reused = 0
|
| 166 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18495 + (2240 = 1116 + 80 + 1043) + 3370 |
|
| 168 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21677 + (1270 = 1116 + 80 + 74) + 1175 |
|
| 169 |
+
llama_memory_breakdown_print: | - Host | 2665 = 2528 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/perplexity_general.txt
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20939 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
|
| 32 |
+
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 41 |
+
llama_model_loader: - kv 30: general.quantization_version u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: general.file_type u32 = 38
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type f16: 37 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 4.65 GiB (9.93 BPW)
|
| 49 |
+
load: printing all EOG tokens:
|
| 50 |
+
load: - 151643 ('<|endoftext|>')
|
| 51 |
+
load: - 151645 ('<|im_end|>')
|
| 52 |
+
load: - 151662 ('<|fim_pad|>')
|
| 53 |
+
load: - 151663 ('<|repo_name|>')
|
| 54 |
+
load: - 151664 ('<|file_sep|>')
|
| 55 |
+
load: special tokens cache size = 26
|
| 56 |
+
load: token to piece cache size = 0.9311 MB
|
| 57 |
+
print_info: arch = qwen3
|
| 58 |
+
print_info: vocab_only = 0
|
| 59 |
+
print_info: n_ctx_train = 262144
|
| 60 |
+
print_info: n_embd = 2560
|
| 61 |
+
print_info: n_embd_inp = 2560
|
| 62 |
+
print_info: n_layer = 36
|
| 63 |
+
print_info: n_head = 32
|
| 64 |
+
print_info: n_head_kv = 8
|
| 65 |
+
print_info: n_rot = 128
|
| 66 |
+
print_info: n_swa = 0
|
| 67 |
+
print_info: is_swa_any = 0
|
| 68 |
+
print_info: n_embd_head_k = 128
|
| 69 |
+
print_info: n_embd_head_v = 128
|
| 70 |
+
print_info: n_gqa = 4
|
| 71 |
+
print_info: n_embd_k_gqa = 1024
|
| 72 |
+
print_info: n_embd_v_gqa = 1024
|
| 73 |
+
print_info: f_norm_eps = 0.0e+00
|
| 74 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 75 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 76 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 77 |
+
print_info: f_logit_scale = 0.0e+00
|
| 78 |
+
print_info: f_attn_scale = 0.0e+00
|
| 79 |
+
print_info: n_ff = 9728
|
| 80 |
+
print_info: n_expert = 0
|
| 81 |
+
print_info: n_expert_used = 0
|
| 82 |
+
print_info: n_expert_groups = 0
|
| 83 |
+
print_info: n_group_used = 0
|
| 84 |
+
print_info: causal attn = 1
|
| 85 |
+
print_info: pooling type = -1
|
| 86 |
+
print_info: rope type = 2
|
| 87 |
+
print_info: rope scaling = linear
|
| 88 |
+
print_info: freq_base_train = 5000000.0
|
| 89 |
+
print_info: freq_scale_train = 1
|
| 90 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 91 |
+
print_info: rope_finetuned = unknown
|
| 92 |
+
print_info: model type = 4B
|
| 93 |
+
print_info: model params = 4.02 B
|
| 94 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 95 |
+
print_info: vocab type = BPE
|
| 96 |
+
print_info: n_vocab = 151936
|
| 97 |
+
print_info: n_merges = 151387
|
| 98 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 99 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 101 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 102 |
+
print_info: LF token = 198 'Ċ'
|
| 103 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 104 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 105 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 106 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 107 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 108 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 109 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 110 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 111 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: max token length = 256
|
| 115 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 116 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 117 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 118 |
+
load_tensors: CPU_Mapped model buffer size = 2528.46 MiB
|
| 119 |
+
load_tensors: CUDA0 model buffer size = 1116.61 MiB
|
| 120 |
+
load_tensors: CUDA1 model buffer size = 1116.61 MiB
|
| 121 |
+
......................................................................................
|
| 122 |
+
llama_context: constructing llama_context
|
| 123 |
+
llama_context: n_seq_max = 1
|
| 124 |
+
llama_context: n_ctx = 2048
|
| 125 |
+
llama_context: n_ctx_seq = 2048
|
| 126 |
+
llama_context: n_batch = 2048
|
| 127 |
+
llama_context: n_ubatch = 512
|
| 128 |
+
llama_context: causal_attn = 1
|
| 129 |
+
llama_context: flash_attn = auto
|
| 130 |
+
llama_context: kv_unified = false
|
| 131 |
+
llama_context: freq_base = 5000000.0
|
| 132 |
+
llama_context: freq_scale = 1
|
| 133 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 134 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 135 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 138 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 139 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 140 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 141 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 142 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 143 |
+
llama_context: graph nodes = 1267
|
| 144 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 145 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 150 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 151 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 152 |
+
|
| 153 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 154 |
+
perplexity: tokenizing the input ..
|
| 155 |
+
perplexity: tokenization took 48.146 ms
|
| 156 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 157 |
+
perplexity: 1.39 seconds per pass - ETA 0.33 minutes
|
| 158 |
+
[1]8.2598,[2]10.3805,[3]10.7340,[4]10.3999,[5]10.1156,[6]8.6338,[7]7.7530,[8]7.7359,[9]8.1547,[10]8.2916,[11]8.3127,[12]8.6472,[13]8.6833,[14]8.8168,[15]8.8889,
|
| 159 |
+
Final estimate: PPL = 8.8889 +/- 0.20579
|
| 160 |
+
|
| 161 |
+
llama_perf_context_print: load time = 858.97 ms
|
| 162 |
+
llama_perf_context_print: prompt eval time = 17152.56 ms / 30720 tokens ( 0.56 ms per token, 1790.99 tokens per second)
|
| 163 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 164 |
+
llama_perf_context_print: total time = 17578.64 ms / 30721 tokens
|
| 165 |
+
llama_perf_context_print: graphs reused = 0
|
| 166 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18509 + (2240 = 1116 + 80 + 1043) + 3356 |
|
| 168 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21677 + (1270 = 1116 + 80 + 74) + 1175 |
|
| 169 |
+
llama_memory_breakdown_print: | - Host | 2665 = 2528 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/perplexity_math.txt
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20931 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
|
| 32 |
+
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 41 |
+
llama_model_loader: - kv 30: general.quantization_version u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: general.file_type u32 = 38
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type f16: 37 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 4.65 GiB (9.93 BPW)
|
| 49 |
+
load: printing all EOG tokens:
|
| 50 |
+
load: - 151643 ('<|endoftext|>')
|
| 51 |
+
load: - 151645 ('<|im_end|>')
|
| 52 |
+
load: - 151662 ('<|fim_pad|>')
|
| 53 |
+
load: - 151663 ('<|repo_name|>')
|
| 54 |
+
load: - 151664 ('<|file_sep|>')
|
| 55 |
+
load: special tokens cache size = 26
|
| 56 |
+
load: token to piece cache size = 0.9311 MB
|
| 57 |
+
print_info: arch = qwen3
|
| 58 |
+
print_info: vocab_only = 0
|
| 59 |
+
print_info: n_ctx_train = 262144
|
| 60 |
+
print_info: n_embd = 2560
|
| 61 |
+
print_info: n_embd_inp = 2560
|
| 62 |
+
print_info: n_layer = 36
|
| 63 |
+
print_info: n_head = 32
|
| 64 |
+
print_info: n_head_kv = 8
|
| 65 |
+
print_info: n_rot = 128
|
| 66 |
+
print_info: n_swa = 0
|
| 67 |
+
print_info: is_swa_any = 0
|
| 68 |
+
print_info: n_embd_head_k = 128
|
| 69 |
+
print_info: n_embd_head_v = 128
|
| 70 |
+
print_info: n_gqa = 4
|
| 71 |
+
print_info: n_embd_k_gqa = 1024
|
| 72 |
+
print_info: n_embd_v_gqa = 1024
|
| 73 |
+
print_info: f_norm_eps = 0.0e+00
|
| 74 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 75 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 76 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 77 |
+
print_info: f_logit_scale = 0.0e+00
|
| 78 |
+
print_info: f_attn_scale = 0.0e+00
|
| 79 |
+
print_info: n_ff = 9728
|
| 80 |
+
print_info: n_expert = 0
|
| 81 |
+
print_info: n_expert_used = 0
|
| 82 |
+
print_info: n_expert_groups = 0
|
| 83 |
+
print_info: n_group_used = 0
|
| 84 |
+
print_info: causal attn = 1
|
| 85 |
+
print_info: pooling type = -1
|
| 86 |
+
print_info: rope type = 2
|
| 87 |
+
print_info: rope scaling = linear
|
| 88 |
+
print_info: freq_base_train = 5000000.0
|
| 89 |
+
print_info: freq_scale_train = 1
|
| 90 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 91 |
+
print_info: rope_finetuned = unknown
|
| 92 |
+
print_info: model type = 4B
|
| 93 |
+
print_info: model params = 4.02 B
|
| 94 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 95 |
+
print_info: vocab type = BPE
|
| 96 |
+
print_info: n_vocab = 151936
|
| 97 |
+
print_info: n_merges = 151387
|
| 98 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 99 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 101 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 102 |
+
print_info: LF token = 198 'Ċ'
|
| 103 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 104 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 105 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 106 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 107 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 108 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 109 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 110 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 111 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: max token length = 256
|
| 115 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 116 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 117 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 118 |
+
load_tensors: CPU_Mapped model buffer size = 2528.46 MiB
|
| 119 |
+
load_tensors: CUDA0 model buffer size = 1116.61 MiB
|
| 120 |
+
load_tensors: CUDA1 model buffer size = 1116.61 MiB
|
| 121 |
+
......................................................................................
|
| 122 |
+
llama_context: constructing llama_context
|
| 123 |
+
llama_context: n_seq_max = 1
|
| 124 |
+
llama_context: n_ctx = 2048
|
| 125 |
+
llama_context: n_ctx_seq = 2048
|
| 126 |
+
llama_context: n_batch = 2048
|
| 127 |
+
llama_context: n_ubatch = 512
|
| 128 |
+
llama_context: causal_attn = 1
|
| 129 |
+
llama_context: flash_attn = auto
|
| 130 |
+
llama_context: kv_unified = false
|
| 131 |
+
llama_context: freq_base = 5000000.0
|
| 132 |
+
llama_context: freq_scale = 1
|
| 133 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 134 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 135 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 138 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 139 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 140 |
+
llama_context: CUDA0 compute buffer size = 1043.62 MiB
|
| 141 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 142 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 143 |
+
llama_context: graph nodes = 1267
|
| 144 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 145 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 150 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 151 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 152 |
+
|
| 153 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 154 |
+
perplexity: tokenizing the input ..
|
| 155 |
+
perplexity: tokenization took 50.843 ms
|
| 156 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 157 |
+
perplexity: 1.36 seconds per pass - ETA 0.35 minutes
|
| 158 |
+
[1]5.5426,[2]6.1761,[3]6.4316,[4]6.6081,[5]6.8508,[6]6.7906,[7]6.7447,[8]6.6441,[9]6.6703,[10]6.6187,[11]6.6579,[12]6.6434,[13]6.7290,[14]6.7311,[15]6.7287,[16]6.7153,
|
| 159 |
+
Final estimate: PPL = 6.7153 +/- 0.13704
|
| 160 |
+
|
| 161 |
+
llama_perf_context_print: load time = 863.83 ms
|
| 162 |
+
llama_perf_context_print: prompt eval time = 18006.78 ms / 32768 tokens ( 0.55 ms per token, 1819.76 tokens per second)
|
| 163 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 164 |
+
llama_perf_context_print: total time = 18463.17 ms / 32769 tokens
|
| 165 |
+
llama_perf_context_print: graphs reused = 0
|
| 166 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 18495 + (2240 = 1116 + 80 + 1043) + 3370 |
|
| 168 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21677 + (1270 = 1116 + 80 + 74) + 1175 |
|
| 169 |
+
llama_memory_breakdown_print: | - Host | 2665 = 2528 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-F16/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B MXFP4 MoE | 3.62 GiB | 4.02 B | CUDA | 35 | pp8 | 455.17 ± 16.16 |
|
| 9 |
+
| qwen3 4B MXFP4 MoE | 3.62 GiB | 4.02 B | CUDA | 35 | tg128 | 83.56 ± 1.10 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21056 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
|
| 32 |
+
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 41 |
+
llama_model_loader: - kv 30: general.quantization_version u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: general.file_type u32 = 38
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 45 |
+
llama_model_loader: - type q4_K: 37 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 3.62 GiB (7.74 BPW)
|
| 49 |
+
load: printing all EOG tokens:
|
| 50 |
+
load: - 151643 ('<|endoftext|>')
|
| 51 |
+
load: - 151645 ('<|im_end|>')
|
| 52 |
+
load: - 151662 ('<|fim_pad|>')
|
| 53 |
+
load: - 151663 ('<|repo_name|>')
|
| 54 |
+
load: - 151664 ('<|file_sep|>')
|
| 55 |
+
load: special tokens cache size = 26
|
| 56 |
+
load: token to piece cache size = 0.9311 MB
|
| 57 |
+
print_info: arch = qwen3
|
| 58 |
+
print_info: vocab_only = 0
|
| 59 |
+
print_info: n_ctx_train = 262144
|
| 60 |
+
print_info: n_embd = 2560
|
| 61 |
+
print_info: n_embd_inp = 2560
|
| 62 |
+
print_info: n_layer = 36
|
| 63 |
+
print_info: n_head = 32
|
| 64 |
+
print_info: n_head_kv = 8
|
| 65 |
+
print_info: n_rot = 128
|
| 66 |
+
print_info: n_swa = 0
|
| 67 |
+
print_info: is_swa_any = 0
|
| 68 |
+
print_info: n_embd_head_k = 128
|
| 69 |
+
print_info: n_embd_head_v = 128
|
| 70 |
+
print_info: n_gqa = 4
|
| 71 |
+
print_info: n_embd_k_gqa = 1024
|
| 72 |
+
print_info: n_embd_v_gqa = 1024
|
| 73 |
+
print_info: f_norm_eps = 0.0e+00
|
| 74 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 75 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 76 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 77 |
+
print_info: f_logit_scale = 0.0e+00
|
| 78 |
+
print_info: f_attn_scale = 0.0e+00
|
| 79 |
+
print_info: n_ff = 9728
|
| 80 |
+
print_info: n_expert = 0
|
| 81 |
+
print_info: n_expert_used = 0
|
| 82 |
+
print_info: n_expert_groups = 0
|
| 83 |
+
print_info: n_group_used = 0
|
| 84 |
+
print_info: causal attn = 1
|
| 85 |
+
print_info: pooling type = -1
|
| 86 |
+
print_info: rope type = 2
|
| 87 |
+
print_info: rope scaling = linear
|
| 88 |
+
print_info: freq_base_train = 5000000.0
|
| 89 |
+
print_info: freq_scale_train = 1
|
| 90 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 91 |
+
print_info: rope_finetuned = unknown
|
| 92 |
+
print_info: model type = 4B
|
| 93 |
+
print_info: model params = 4.02 B
|
| 94 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 95 |
+
print_info: vocab type = BPE
|
| 96 |
+
print_info: n_vocab = 151936
|
| 97 |
+
print_info: n_merges = 151387
|
| 98 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 99 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 101 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 102 |
+
print_info: LF token = 198 'Ċ'
|
| 103 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 104 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 105 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 106 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 107 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 108 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 109 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 110 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 111 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: max token length = 256
|
| 115 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 116 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 117 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 118 |
+
load_tensors: CPU_Mapped model buffer size = 1765.24 MiB
|
| 119 |
+
load_tensors: CUDA0 model buffer size = 972.86 MiB
|
| 120 |
+
load_tensors: CUDA1 model buffer size = 972.86 MiB
|
| 121 |
+
................................................................................................
|
| 122 |
+
llama_context: constructing llama_context
|
| 123 |
+
llama_context: n_seq_max = 1
|
| 124 |
+
llama_context: n_ctx = 2048
|
| 125 |
+
llama_context: n_ctx_seq = 2048
|
| 126 |
+
llama_context: n_batch = 2048
|
| 127 |
+
llama_context: n_ubatch = 512
|
| 128 |
+
llama_context: causal_attn = 1
|
| 129 |
+
llama_context: flash_attn = auto
|
| 130 |
+
llama_context: kv_unified = false
|
| 131 |
+
llama_context: freq_base = 5000000.0
|
| 132 |
+
llama_context: freq_scale = 1
|
| 133 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 134 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 135 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 138 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 139 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 140 |
+
llama_context: CUDA0 compute buffer size = 510.40 MiB
|
| 141 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 142 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 143 |
+
llama_context: graph nodes = 1267
|
| 144 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 145 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 150 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 151 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 152 |
+
|
| 153 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 154 |
+
perplexity: tokenizing the input ..
|
| 155 |
+
perplexity: tokenization took 117.845 ms
|
| 156 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 157 |
+
perplexity: 1.20 seconds per pass - ETA 0.87 minutes
|
| 158 |
+
[1]3.1495,[2]2.4863,[3]1.8363,[4]1.6911,[5]1.8121,[6]1.8652,[7]1.8212,[8]1.7925,[9]1.7083,[10]1.6524,[11]1.6186,[12]1.6217,[13]1.5869,[14]1.5634,[15]1.5849,[16]1.5629,[17]1.5500,[18]1.5571,[19]1.5425,[20]1.5224,[21]1.5146,[22]1.5108,[23]1.5332,[24]1.5196,[25]1.5257,[26]1.5081,[27]1.4988,[28]1.4973,[29]1.5127,[30]1.5167,[31]1.5064,[32]1.4955,[33]1.4980,[34]1.4954,[35]1.4945,[36]1.5223,[37]1.5325,[38]1.5382,[39]1.5455,[40]1.5470,[41]1.5404,[42]1.5550,[43]1.5565,[44]1.5573,
|
| 159 |
+
Final estimate: PPL = 1.5573 +/- 0.01241
|
| 160 |
+
|
| 161 |
+
llama_perf_context_print: load time = 767.54 ms
|
| 162 |
+
llama_perf_context_print: prompt eval time = 43046.67 ms / 90112 tokens ( 0.48 ms per token, 2093.36 tokens per second)
|
| 163 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 164 |
+
llama_perf_context_print: total time = 44253.66 ms / 90113 tokens
|
| 165 |
+
llama_perf_context_print: graphs reused = 0
|
| 166 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 19277 + (1563 = 972 + 80 + 510) + 3266 |
|
| 168 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21833 + (1126 = 972 + 80 + 74) + 1163 |
|
| 169 |
+
llama_memory_breakdown_print: | - Host | 1902 = 1765 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21055 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
|
| 32 |
+
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 41 |
+
llama_model_loader: - kv 30: general.quantization_version u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: general.file_type u32 = 38
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 45 |
+
llama_model_loader: - type q4_K: 37 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 3.62 GiB (7.74 BPW)
|
| 49 |
+
load: printing all EOG tokens:
|
| 50 |
+
load: - 151643 ('<|endoftext|>')
|
| 51 |
+
load: - 151645 ('<|im_end|>')
|
| 52 |
+
load: - 151662 ('<|fim_pad|>')
|
| 53 |
+
load: - 151663 ('<|repo_name|>')
|
| 54 |
+
load: - 151664 ('<|file_sep|>')
|
| 55 |
+
load: special tokens cache size = 26
|
| 56 |
+
load: token to piece cache size = 0.9311 MB
|
| 57 |
+
print_info: arch = qwen3
|
| 58 |
+
print_info: vocab_only = 0
|
| 59 |
+
print_info: n_ctx_train = 262144
|
| 60 |
+
print_info: n_embd = 2560
|
| 61 |
+
print_info: n_embd_inp = 2560
|
| 62 |
+
print_info: n_layer = 36
|
| 63 |
+
print_info: n_head = 32
|
| 64 |
+
print_info: n_head_kv = 8
|
| 65 |
+
print_info: n_rot = 128
|
| 66 |
+
print_info: n_swa = 0
|
| 67 |
+
print_info: is_swa_any = 0
|
| 68 |
+
print_info: n_embd_head_k = 128
|
| 69 |
+
print_info: n_embd_head_v = 128
|
| 70 |
+
print_info: n_gqa = 4
|
| 71 |
+
print_info: n_embd_k_gqa = 1024
|
| 72 |
+
print_info: n_embd_v_gqa = 1024
|
| 73 |
+
print_info: f_norm_eps = 0.0e+00
|
| 74 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 75 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 76 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 77 |
+
print_info: f_logit_scale = 0.0e+00
|
| 78 |
+
print_info: f_attn_scale = 0.0e+00
|
| 79 |
+
print_info: n_ff = 9728
|
| 80 |
+
print_info: n_expert = 0
|
| 81 |
+
print_info: n_expert_used = 0
|
| 82 |
+
print_info: n_expert_groups = 0
|
| 83 |
+
print_info: n_group_used = 0
|
| 84 |
+
print_info: causal attn = 1
|
| 85 |
+
print_info: pooling type = -1
|
| 86 |
+
print_info: rope type = 2
|
| 87 |
+
print_info: rope scaling = linear
|
| 88 |
+
print_info: freq_base_train = 5000000.0
|
| 89 |
+
print_info: freq_scale_train = 1
|
| 90 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 91 |
+
print_info: rope_finetuned = unknown
|
| 92 |
+
print_info: model type = 4B
|
| 93 |
+
print_info: model params = 4.02 B
|
| 94 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 95 |
+
print_info: vocab type = BPE
|
| 96 |
+
print_info: n_vocab = 151936
|
| 97 |
+
print_info: n_merges = 151387
|
| 98 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 99 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 101 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 102 |
+
print_info: LF token = 198 'Ċ'
|
| 103 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 104 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 105 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 106 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 107 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 108 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 109 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 110 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 111 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: max token length = 256
|
| 115 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 116 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 117 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 118 |
+
load_tensors: CPU_Mapped model buffer size = 1765.24 MiB
|
| 119 |
+
load_tensors: CUDA0 model buffer size = 972.86 MiB
|
| 120 |
+
load_tensors: CUDA1 model buffer size = 972.86 MiB
|
| 121 |
+
................................................................................................
|
| 122 |
+
llama_context: constructing llama_context
|
| 123 |
+
llama_context: n_seq_max = 1
|
| 124 |
+
llama_context: n_ctx = 2048
|
| 125 |
+
llama_context: n_ctx_seq = 2048
|
| 126 |
+
llama_context: n_batch = 2048
|
| 127 |
+
llama_context: n_ubatch = 512
|
| 128 |
+
llama_context: causal_attn = 1
|
| 129 |
+
llama_context: flash_attn = auto
|
| 130 |
+
llama_context: kv_unified = false
|
| 131 |
+
llama_context: freq_base = 5000000.0
|
| 132 |
+
llama_context: freq_scale = 1
|
| 133 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 134 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 135 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 138 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 139 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 140 |
+
llama_context: CUDA0 compute buffer size = 510.40 MiB
|
| 141 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 142 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 143 |
+
llama_context: graph nodes = 1267
|
| 144 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 145 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 150 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 151 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 152 |
+
|
| 153 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 154 |
+
perplexity: tokenizing the input ..
|
| 155 |
+
perplexity: tokenization took 48.796 ms
|
| 156 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 157 |
+
perplexity: 1.19 seconds per pass - ETA 0.28 minutes
|
| 158 |
+
[1]8.4978,[2]10.6172,[3]11.0539,[4]10.6843,[5]10.3818,[6]8.8341,[7]7.9099,[8]7.9015,[9]8.3533,[10]8.4720,[11]8.4926,[12]8.8281,[13]8.8684,[14]9.0138,[15]9.0787,
|
| 159 |
+
Final estimate: PPL = 9.0787 +/- 0.21163
|
| 160 |
+
|
| 161 |
+
llama_perf_context_print: load time = 761.00 ms
|
| 162 |
+
llama_perf_context_print: prompt eval time = 14802.68 ms / 30720 tokens ( 0.48 ms per token, 2075.30 tokens per second)
|
| 163 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 164 |
+
llama_perf_context_print: total time = 15217.86 ms / 30721 tokens
|
| 165 |
+
llama_perf_context_print: graphs reused = 0
|
| 166 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 19394 + (1563 = 972 + 80 + 510) + 3149 |
|
| 168 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21833 + (1126 = 972 + 80 + 74) + 1163 |
|
| 169 |
+
llama_memory_breakdown_print: | - Host | 1902 = 1765 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20939 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
|
| 32 |
+
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 41 |
+
llama_model_loader: - kv 30: general.quantization_version u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: general.file_type u32 = 38
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 45 |
+
llama_model_loader: - type q4_K: 37 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 3.62 GiB (7.74 BPW)
|
| 49 |
+
load: printing all EOG tokens:
|
| 50 |
+
load: - 151643 ('<|endoftext|>')
|
| 51 |
+
load: - 151645 ('<|im_end|>')
|
| 52 |
+
load: - 151662 ('<|fim_pad|>')
|
| 53 |
+
load: - 151663 ('<|repo_name|>')
|
| 54 |
+
load: - 151664 ('<|file_sep|>')
|
| 55 |
+
load: special tokens cache size = 26
|
| 56 |
+
load: token to piece cache size = 0.9311 MB
|
| 57 |
+
print_info: arch = qwen3
|
| 58 |
+
print_info: vocab_only = 0
|
| 59 |
+
print_info: n_ctx_train = 262144
|
| 60 |
+
print_info: n_embd = 2560
|
| 61 |
+
print_info: n_embd_inp = 2560
|
| 62 |
+
print_info: n_layer = 36
|
| 63 |
+
print_info: n_head = 32
|
| 64 |
+
print_info: n_head_kv = 8
|
| 65 |
+
print_info: n_rot = 128
|
| 66 |
+
print_info: n_swa = 0
|
| 67 |
+
print_info: is_swa_any = 0
|
| 68 |
+
print_info: n_embd_head_k = 128
|
| 69 |
+
print_info: n_embd_head_v = 128
|
| 70 |
+
print_info: n_gqa = 4
|
| 71 |
+
print_info: n_embd_k_gqa = 1024
|
| 72 |
+
print_info: n_embd_v_gqa = 1024
|
| 73 |
+
print_info: f_norm_eps = 0.0e+00
|
| 74 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 75 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 76 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 77 |
+
print_info: f_logit_scale = 0.0e+00
|
| 78 |
+
print_info: f_attn_scale = 0.0e+00
|
| 79 |
+
print_info: n_ff = 9728
|
| 80 |
+
print_info: n_expert = 0
|
| 81 |
+
print_info: n_expert_used = 0
|
| 82 |
+
print_info: n_expert_groups = 0
|
| 83 |
+
print_info: n_group_used = 0
|
| 84 |
+
print_info: causal attn = 1
|
| 85 |
+
print_info: pooling type = -1
|
| 86 |
+
print_info: rope type = 2
|
| 87 |
+
print_info: rope scaling = linear
|
| 88 |
+
print_info: freq_base_train = 5000000.0
|
| 89 |
+
print_info: freq_scale_train = 1
|
| 90 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 91 |
+
print_info: rope_finetuned = unknown
|
| 92 |
+
print_info: model type = 4B
|
| 93 |
+
print_info: model params = 4.02 B
|
| 94 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 95 |
+
print_info: vocab type = BPE
|
| 96 |
+
print_info: n_vocab = 151936
|
| 97 |
+
print_info: n_merges = 151387
|
| 98 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 99 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 101 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 102 |
+
print_info: LF token = 198 'Ċ'
|
| 103 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 104 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 105 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 106 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 107 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 108 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 109 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 110 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 111 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: max token length = 256
|
| 115 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 116 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 117 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 118 |
+
load_tensors: CPU_Mapped model buffer size = 1765.24 MiB
|
| 119 |
+
load_tensors: CUDA0 model buffer size = 972.86 MiB
|
| 120 |
+
load_tensors: CUDA1 model buffer size = 972.86 MiB
|
| 121 |
+
................................................................................................
|
| 122 |
+
llama_context: constructing llama_context
|
| 123 |
+
llama_context: n_seq_max = 1
|
| 124 |
+
llama_context: n_ctx = 2048
|
| 125 |
+
llama_context: n_ctx_seq = 2048
|
| 126 |
+
llama_context: n_batch = 2048
|
| 127 |
+
llama_context: n_ubatch = 512
|
| 128 |
+
llama_context: causal_attn = 1
|
| 129 |
+
llama_context: flash_attn = auto
|
| 130 |
+
llama_context: kv_unified = false
|
| 131 |
+
llama_context: freq_base = 5000000.0
|
| 132 |
+
llama_context: freq_scale = 1
|
| 133 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 134 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 135 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 138 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 139 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 140 |
+
llama_context: CUDA0 compute buffer size = 510.40 MiB
|
| 141 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 142 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 143 |
+
llama_context: graph nodes = 1267
|
| 144 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 145 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 150 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 151 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 152 |
+
|
| 153 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 154 |
+
perplexity: tokenizing the input ..
|
| 155 |
+
perplexity: tokenization took 44.834 ms
|
| 156 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 157 |
+
perplexity: 1.19 seconds per pass - ETA 0.30 minutes
|
| 158 |
+
[1]5.6201,[2]6.2112,[3]6.4458,[4]6.6239,[5]6.8619,[6]6.8207,[7]6.7793,[8]6.6813,[9]6.7188,[10]6.6682,[11]6.6976,[12]6.6862,[13]6.7737,[14]6.7758,[15]6.7780,[16]6.7696,
|
| 159 |
+
Final estimate: PPL = 6.7696 +/- 0.13721
|
| 160 |
+
|
| 161 |
+
llama_perf_context_print: load time = 760.16 ms
|
| 162 |
+
llama_perf_context_print: prompt eval time = 15737.60 ms / 32768 tokens ( 0.48 ms per token, 2082.15 tokens per second)
|
| 163 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 164 |
+
llama_perf_context_print: total time = 16169.20 ms / 32769 tokens
|
| 165 |
+
llama_perf_context_print: graphs reused = 0
|
| 166 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 19277 + (1563 = 972 + 80 + 510) + 3266 |
|
| 168 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21833 + (1126 = 972 + 80 + 74) + 1163 |
|
| 169 |
+
llama_memory_breakdown_print: | - Host | 1902 = 1765 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q4_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B MXFP4 MoE | 3.71 GiB | 4.02 B | CUDA | 35 | pp8 | 439.64 ± 7.72 |
|
| 9 |
+
| qwen3 4B MXFP4 MoE | 3.71 GiB | 4.02 B | CUDA | 35 | tg128 | 76.14 ± 0.82 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21056 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
|
| 32 |
+
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 41 |
+
llama_model_loader: - kv 30: general.quantization_version u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: general.file_type u32 = 38
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 37 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 3.71 GiB (7.93 BPW)
|
| 49 |
+
load: printing all EOG tokens:
|
| 50 |
+
load: - 151643 ('<|endoftext|>')
|
| 51 |
+
load: - 151645 ('<|im_end|>')
|
| 52 |
+
load: - 151662 ('<|fim_pad|>')
|
| 53 |
+
load: - 151663 ('<|repo_name|>')
|
| 54 |
+
load: - 151664 ('<|file_sep|>')
|
| 55 |
+
load: special tokens cache size = 26
|
| 56 |
+
load: token to piece cache size = 0.9311 MB
|
| 57 |
+
print_info: arch = qwen3
|
| 58 |
+
print_info: vocab_only = 0
|
| 59 |
+
print_info: n_ctx_train = 262144
|
| 60 |
+
print_info: n_embd = 2560
|
| 61 |
+
print_info: n_embd_inp = 2560
|
| 62 |
+
print_info: n_layer = 36
|
| 63 |
+
print_info: n_head = 32
|
| 64 |
+
print_info: n_head_kv = 8
|
| 65 |
+
print_info: n_rot = 128
|
| 66 |
+
print_info: n_swa = 0
|
| 67 |
+
print_info: is_swa_any = 0
|
| 68 |
+
print_info: n_embd_head_k = 128
|
| 69 |
+
print_info: n_embd_head_v = 128
|
| 70 |
+
print_info: n_gqa = 4
|
| 71 |
+
print_info: n_embd_k_gqa = 1024
|
| 72 |
+
print_info: n_embd_v_gqa = 1024
|
| 73 |
+
print_info: f_norm_eps = 0.0e+00
|
| 74 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 75 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 76 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 77 |
+
print_info: f_logit_scale = 0.0e+00
|
| 78 |
+
print_info: f_attn_scale = 0.0e+00
|
| 79 |
+
print_info: n_ff = 9728
|
| 80 |
+
print_info: n_expert = 0
|
| 81 |
+
print_info: n_expert_used = 0
|
| 82 |
+
print_info: n_expert_groups = 0
|
| 83 |
+
print_info: n_group_used = 0
|
| 84 |
+
print_info: causal attn = 1
|
| 85 |
+
print_info: pooling type = -1
|
| 86 |
+
print_info: rope type = 2
|
| 87 |
+
print_info: rope scaling = linear
|
| 88 |
+
print_info: freq_base_train = 5000000.0
|
| 89 |
+
print_info: freq_scale_train = 1
|
| 90 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 91 |
+
print_info: rope_finetuned = unknown
|
| 92 |
+
print_info: model type = 4B
|
| 93 |
+
print_info: model params = 4.02 B
|
| 94 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 95 |
+
print_info: vocab type = BPE
|
| 96 |
+
print_info: n_vocab = 151936
|
| 97 |
+
print_info: n_merges = 151387
|
| 98 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 99 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 101 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 102 |
+
print_info: LF token = 198 'Ċ'
|
| 103 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 104 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 105 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 106 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 107 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 108 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 109 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 110 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 111 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: max token length = 256
|
| 115 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 116 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 117 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 118 |
+
load_tensors: CPU_Mapped model buffer size = 1831.61 MiB
|
| 119 |
+
load_tensors: CUDA0 model buffer size = 985.36 MiB
|
| 120 |
+
load_tensors: CUDA1 model buffer size = 985.36 MiB
|
| 121 |
+
...............................................................................................
|
| 122 |
+
llama_context: constructing llama_context
|
| 123 |
+
llama_context: n_seq_max = 1
|
| 124 |
+
llama_context: n_ctx = 2048
|
| 125 |
+
llama_context: n_ctx_seq = 2048
|
| 126 |
+
llama_context: n_batch = 2048
|
| 127 |
+
llama_context: n_ubatch = 512
|
| 128 |
+
llama_context: causal_attn = 1
|
| 129 |
+
llama_context: flash_attn = auto
|
| 130 |
+
llama_context: kv_unified = false
|
| 131 |
+
llama_context: freq_base = 5000000.0
|
| 132 |
+
llama_context: freq_scale = 1
|
| 133 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 134 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 135 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 138 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 139 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 140 |
+
llama_context: CUDA0 compute buffer size = 556.77 MiB
|
| 141 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 142 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 143 |
+
llama_context: graph nodes = 1267
|
| 144 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 145 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 150 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 151 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 152 |
+
|
| 153 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 154 |
+
perplexity: tokenizing the input ..
|
| 155 |
+
perplexity: tokenization took 115.032 ms
|
| 156 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 157 |
+
perplexity: 1.21 seconds per pass - ETA 0.88 minutes
|
| 158 |
+
[1]3.1166,[2]2.4590,[3]1.8225,[4]1.6805,[5]1.8025,[6]1.8557,[7]1.8119,[8]1.7833,[9]1.7009,[10]1.6449,[11]1.6121,[12]1.6137,[13]1.5795,[14]1.5569,[15]1.5787,[16]1.5573,[17]1.5447,[18]1.5517,[19]1.5371,[20]1.5173,[21]1.5099,[22]1.5061,[23]1.5273,[24]1.5141,[25]1.5198,[26]1.5024,[27]1.4931,[28]1.4915,[29]1.5068,[30]1.5103,[31]1.5002,[32]1.4895,[33]1.4922,[34]1.4896,[35]1.4890,[36]1.5163,[37]1.5265,[38]1.5322,[39]1.5395,[40]1.5408,[41]1.5343,[42]1.5486,[43]1.5502,[44]1.5510,
|
| 159 |
+
Final estimate: PPL = 1.5510 +/- 0.01230
|
| 160 |
+
|
| 161 |
+
llama_perf_context_print: load time = 771.64 ms
|
| 162 |
+
llama_perf_context_print: prompt eval time = 43603.18 ms / 90112 tokens ( 0.48 ms per token, 2066.64 tokens per second)
|
| 163 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 164 |
+
llama_perf_context_print: total time = 44801.05 ms / 90113 tokens
|
| 165 |
+
llama_perf_context_print: graphs reused = 0
|
| 166 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 19335 + (1622 = 985 + 80 + 556) + 3149 |
|
| 168 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21821 + (1139 = 985 + 80 + 74) + 1163 |
|
| 169 |
+
llama_memory_breakdown_print: | - Host | 1968 = 1831 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21056 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
|
| 32 |
+
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 41 |
+
llama_model_loader: - kv 30: general.quantization_version u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: general.file_type u32 = 38
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 37 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 3.71 GiB (7.93 BPW)
|
| 49 |
+
load: printing all EOG tokens:
|
| 50 |
+
load: - 151643 ('<|endoftext|>')
|
| 51 |
+
load: - 151645 ('<|im_end|>')
|
| 52 |
+
load: - 151662 ('<|fim_pad|>')
|
| 53 |
+
load: - 151663 ('<|repo_name|>')
|
| 54 |
+
load: - 151664 ('<|file_sep|>')
|
| 55 |
+
load: special tokens cache size = 26
|
| 56 |
+
load: token to piece cache size = 0.9311 MB
|
| 57 |
+
print_info: arch = qwen3
|
| 58 |
+
print_info: vocab_only = 0
|
| 59 |
+
print_info: n_ctx_train = 262144
|
| 60 |
+
print_info: n_embd = 2560
|
| 61 |
+
print_info: n_embd_inp = 2560
|
| 62 |
+
print_info: n_layer = 36
|
| 63 |
+
print_info: n_head = 32
|
| 64 |
+
print_info: n_head_kv = 8
|
| 65 |
+
print_info: n_rot = 128
|
| 66 |
+
print_info: n_swa = 0
|
| 67 |
+
print_info: is_swa_any = 0
|
| 68 |
+
print_info: n_embd_head_k = 128
|
| 69 |
+
print_info: n_embd_head_v = 128
|
| 70 |
+
print_info: n_gqa = 4
|
| 71 |
+
print_info: n_embd_k_gqa = 1024
|
| 72 |
+
print_info: n_embd_v_gqa = 1024
|
| 73 |
+
print_info: f_norm_eps = 0.0e+00
|
| 74 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 75 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 76 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 77 |
+
print_info: f_logit_scale = 0.0e+00
|
| 78 |
+
print_info: f_attn_scale = 0.0e+00
|
| 79 |
+
print_info: n_ff = 9728
|
| 80 |
+
print_info: n_expert = 0
|
| 81 |
+
print_info: n_expert_used = 0
|
| 82 |
+
print_info: n_expert_groups = 0
|
| 83 |
+
print_info: n_group_used = 0
|
| 84 |
+
print_info: causal attn = 1
|
| 85 |
+
print_info: pooling type = -1
|
| 86 |
+
print_info: rope type = 2
|
| 87 |
+
print_info: rope scaling = linear
|
| 88 |
+
print_info: freq_base_train = 5000000.0
|
| 89 |
+
print_info: freq_scale_train = 1
|
| 90 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 91 |
+
print_info: rope_finetuned = unknown
|
| 92 |
+
print_info: model type = 4B
|
| 93 |
+
print_info: model params = 4.02 B
|
| 94 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 95 |
+
print_info: vocab type = BPE
|
| 96 |
+
print_info: n_vocab = 151936
|
| 97 |
+
print_info: n_merges = 151387
|
| 98 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 99 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 101 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 102 |
+
print_info: LF token = 198 'Ċ'
|
| 103 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 104 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 105 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 106 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 107 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 108 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 109 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 110 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 111 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: max token length = 256
|
| 115 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 116 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 117 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 118 |
+
load_tensors: CPU_Mapped model buffer size = 1831.61 MiB
|
| 119 |
+
load_tensors: CUDA0 model buffer size = 985.36 MiB
|
| 120 |
+
load_tensors: CUDA1 model buffer size = 985.36 MiB
|
| 121 |
+
...............................................................................................
|
| 122 |
+
llama_context: constructing llama_context
|
| 123 |
+
llama_context: n_seq_max = 1
|
| 124 |
+
llama_context: n_ctx = 2048
|
| 125 |
+
llama_context: n_ctx_seq = 2048
|
| 126 |
+
llama_context: n_batch = 2048
|
| 127 |
+
llama_context: n_ubatch = 512
|
| 128 |
+
llama_context: causal_attn = 1
|
| 129 |
+
llama_context: flash_attn = auto
|
| 130 |
+
llama_context: kv_unified = false
|
| 131 |
+
llama_context: freq_base = 5000000.0
|
| 132 |
+
llama_context: freq_scale = 1
|
| 133 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 134 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 135 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 138 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 139 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 140 |
+
llama_context: CUDA0 compute buffer size = 556.77 MiB
|
| 141 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 142 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 143 |
+
llama_context: graph nodes = 1267
|
| 144 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 145 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 150 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 151 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 152 |
+
|
| 153 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 154 |
+
perplexity: tokenizing the input ..
|
| 155 |
+
perplexity: tokenization took 48.503 ms
|
| 156 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 157 |
+
perplexity: 1.21 seconds per pass - ETA 0.30 minutes
|
| 158 |
+
[1]8.2946,[2]10.4411,[3]10.7667,[4]10.4298,[5]10.1559,[6]8.6677,[7]7.7773,[8]7.7565,[9]8.1755,[10]8.3201,[11]8.3422,[12]8.6817,[13]8.7184,[14]8.8537,[15]8.9263,
|
| 159 |
+
Final estimate: PPL = 8.9263 +/- 0.20660
|
| 160 |
+
|
| 161 |
+
llama_perf_context_print: load time = 788.56 ms
|
| 162 |
+
llama_perf_context_print: prompt eval time = 14871.70 ms / 30720 tokens ( 0.48 ms per token, 2065.67 tokens per second)
|
| 163 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 164 |
+
llama_perf_context_print: total time = 15286.23 ms / 30721 tokens
|
| 165 |
+
llama_perf_context_print: graphs reused = 0
|
| 166 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 19336 + (1622 = 985 + 80 + 556) + 3148 |
|
| 168 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21821 + (1139 = 985 + 80 + 74) + 1163 |
|
| 169 |
+
llama_memory_breakdown_print: | - Host | 1968 = 1831 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21055 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
|
| 32 |
+
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 41 |
+
llama_model_loader: - kv 30: general.quantization_version u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: general.file_type u32 = 38
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 45 |
+
llama_model_loader: - type q5_K: 37 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 3.71 GiB (7.93 BPW)
|
| 49 |
+
load: printing all EOG tokens:
|
| 50 |
+
load: - 151643 ('<|endoftext|>')
|
| 51 |
+
load: - 151645 ('<|im_end|>')
|
| 52 |
+
load: - 151662 ('<|fim_pad|>')
|
| 53 |
+
load: - 151663 ('<|repo_name|>')
|
| 54 |
+
load: - 151664 ('<|file_sep|>')
|
| 55 |
+
load: special tokens cache size = 26
|
| 56 |
+
load: token to piece cache size = 0.9311 MB
|
| 57 |
+
print_info: arch = qwen3
|
| 58 |
+
print_info: vocab_only = 0
|
| 59 |
+
print_info: n_ctx_train = 262144
|
| 60 |
+
print_info: n_embd = 2560
|
| 61 |
+
print_info: n_embd_inp = 2560
|
| 62 |
+
print_info: n_layer = 36
|
| 63 |
+
print_info: n_head = 32
|
| 64 |
+
print_info: n_head_kv = 8
|
| 65 |
+
print_info: n_rot = 128
|
| 66 |
+
print_info: n_swa = 0
|
| 67 |
+
print_info: is_swa_any = 0
|
| 68 |
+
print_info: n_embd_head_k = 128
|
| 69 |
+
print_info: n_embd_head_v = 128
|
| 70 |
+
print_info: n_gqa = 4
|
| 71 |
+
print_info: n_embd_k_gqa = 1024
|
| 72 |
+
print_info: n_embd_v_gqa = 1024
|
| 73 |
+
print_info: f_norm_eps = 0.0e+00
|
| 74 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 75 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 76 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 77 |
+
print_info: f_logit_scale = 0.0e+00
|
| 78 |
+
print_info: f_attn_scale = 0.0e+00
|
| 79 |
+
print_info: n_ff = 9728
|
| 80 |
+
print_info: n_expert = 0
|
| 81 |
+
print_info: n_expert_used = 0
|
| 82 |
+
print_info: n_expert_groups = 0
|
| 83 |
+
print_info: n_group_used = 0
|
| 84 |
+
print_info: causal attn = 1
|
| 85 |
+
print_info: pooling type = -1
|
| 86 |
+
print_info: rope type = 2
|
| 87 |
+
print_info: rope scaling = linear
|
| 88 |
+
print_info: freq_base_train = 5000000.0
|
| 89 |
+
print_info: freq_scale_train = 1
|
| 90 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 91 |
+
print_info: rope_finetuned = unknown
|
| 92 |
+
print_info: model type = 4B
|
| 93 |
+
print_info: model params = 4.02 B
|
| 94 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 95 |
+
print_info: vocab type = BPE
|
| 96 |
+
print_info: n_vocab = 151936
|
| 97 |
+
print_info: n_merges = 151387
|
| 98 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 99 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 101 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 102 |
+
print_info: LF token = 198 'Ċ'
|
| 103 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 104 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 105 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 106 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 107 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 108 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 109 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 110 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 111 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: max token length = 256
|
| 115 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 116 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 117 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 118 |
+
load_tensors: CPU_Mapped model buffer size = 1831.61 MiB
|
| 119 |
+
load_tensors: CUDA0 model buffer size = 985.36 MiB
|
| 120 |
+
load_tensors: CUDA1 model buffer size = 985.36 MiB
|
| 121 |
+
...............................................................................................
|
| 122 |
+
llama_context: constructing llama_context
|
| 123 |
+
llama_context: n_seq_max = 1
|
| 124 |
+
llama_context: n_ctx = 2048
|
| 125 |
+
llama_context: n_ctx_seq = 2048
|
| 126 |
+
llama_context: n_batch = 2048
|
| 127 |
+
llama_context: n_ubatch = 512
|
| 128 |
+
llama_context: causal_attn = 1
|
| 129 |
+
llama_context: flash_attn = auto
|
| 130 |
+
llama_context: kv_unified = false
|
| 131 |
+
llama_context: freq_base = 5000000.0
|
| 132 |
+
llama_context: freq_scale = 1
|
| 133 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 134 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 135 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 138 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 139 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 140 |
+
llama_context: CUDA0 compute buffer size = 556.77 MiB
|
| 141 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 142 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 143 |
+
llama_context: graph nodes = 1267
|
| 144 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 145 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 150 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 151 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 152 |
+
|
| 153 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 154 |
+
perplexity: tokenizing the input ..
|
| 155 |
+
perplexity: tokenization took 47.795 ms
|
| 156 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 157 |
+
perplexity: 1.20 seconds per pass - ETA 0.32 minutes
|
| 158 |
+
[1]5.5364,[2]6.1820,[3]6.4449,[4]6.6243,[5]6.8718,[6]6.8148,[7]6.7697,[8]6.6644,[9]6.6872,[10]6.6394,[11]6.6787,[12]6.6669,[13]6.7538,[14]6.7566,[15]6.7548,[16]6.7425,
|
| 159 |
+
Final estimate: PPL = 6.7425 +/- 0.13778
|
| 160 |
+
|
| 161 |
+
llama_perf_context_print: load time = 773.14 ms
|
| 162 |
+
llama_perf_context_print: prompt eval time = 15909.61 ms / 32768 tokens ( 0.49 ms per token, 2059.64 tokens per second)
|
| 163 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 164 |
+
llama_perf_context_print: total time = 16343.01 ms / 32769 tokens
|
| 165 |
+
llama_perf_context_print: graphs reused = 0
|
| 166 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 19335 + (1622 = 985 + 80 + 556) + 3149 |
|
| 168 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21821 + (1139 = 985 + 80 + 74) + 1163 |
|
| 169 |
+
llama_memory_breakdown_print: | - Host | 1968 = 1831 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q5_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q6_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| qwen3 4B MXFP4 MoE | 3.81 GiB | 4.02 B | CUDA | 35 | pp8 | 412.52 ± 4.59 |
|
| 9 |
+
| qwen3 4B MXFP4 MoE | 3.81 GiB | 4.02 B | CUDA | 35 | tg128 | 70.36 ± 1.14 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q6_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20952 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
|
| 32 |
+
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 41 |
+
llama_model_loader: - kv 30: general.quantization_version u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: general.file_type u32 = 38
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 45 |
+
llama_model_loader: - type q6_K: 37 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 3.81 GiB (8.13 BPW)
|
| 49 |
+
load: printing all EOG tokens:
|
| 50 |
+
load: - 151643 ('<|endoftext|>')
|
| 51 |
+
load: - 151645 ('<|im_end|>')
|
| 52 |
+
load: - 151662 ('<|fim_pad|>')
|
| 53 |
+
load: - 151663 ('<|repo_name|>')
|
| 54 |
+
load: - 151664 ('<|file_sep|>')
|
| 55 |
+
load: special tokens cache size = 26
|
| 56 |
+
load: token to piece cache size = 0.9311 MB
|
| 57 |
+
print_info: arch = qwen3
|
| 58 |
+
print_info: vocab_only = 0
|
| 59 |
+
print_info: n_ctx_train = 262144
|
| 60 |
+
print_info: n_embd = 2560
|
| 61 |
+
print_info: n_embd_inp = 2560
|
| 62 |
+
print_info: n_layer = 36
|
| 63 |
+
print_info: n_head = 32
|
| 64 |
+
print_info: n_head_kv = 8
|
| 65 |
+
print_info: n_rot = 128
|
| 66 |
+
print_info: n_swa = 0
|
| 67 |
+
print_info: is_swa_any = 0
|
| 68 |
+
print_info: n_embd_head_k = 128
|
| 69 |
+
print_info: n_embd_head_v = 128
|
| 70 |
+
print_info: n_gqa = 4
|
| 71 |
+
print_info: n_embd_k_gqa = 1024
|
| 72 |
+
print_info: n_embd_v_gqa = 1024
|
| 73 |
+
print_info: f_norm_eps = 0.0e+00
|
| 74 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 75 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 76 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 77 |
+
print_info: f_logit_scale = 0.0e+00
|
| 78 |
+
print_info: f_attn_scale = 0.0e+00
|
| 79 |
+
print_info: n_ff = 9728
|
| 80 |
+
print_info: n_expert = 0
|
| 81 |
+
print_info: n_expert_used = 0
|
| 82 |
+
print_info: n_expert_groups = 0
|
| 83 |
+
print_info: n_group_used = 0
|
| 84 |
+
print_info: causal attn = 1
|
| 85 |
+
print_info: pooling type = -1
|
| 86 |
+
print_info: rope type = 2
|
| 87 |
+
print_info: rope scaling = linear
|
| 88 |
+
print_info: freq_base_train = 5000000.0
|
| 89 |
+
print_info: freq_scale_train = 1
|
| 90 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 91 |
+
print_info: rope_finetuned = unknown
|
| 92 |
+
print_info: model type = 4B
|
| 93 |
+
print_info: model params = 4.02 B
|
| 94 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 95 |
+
print_info: vocab type = BPE
|
| 96 |
+
print_info: n_vocab = 151936
|
| 97 |
+
print_info: n_merges = 151387
|
| 98 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 99 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 101 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 102 |
+
print_info: LF token = 198 'Ċ'
|
| 103 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 104 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 105 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 106 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 107 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 108 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 109 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 110 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 111 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: max token length = 256
|
| 115 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 116 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 117 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 118 |
+
load_tensors: CPU_Mapped model buffer size = 1902.12 MiB
|
| 119 |
+
load_tensors: CUDA0 model buffer size = 998.64 MiB
|
| 120 |
+
load_tensors: CUDA1 model buffer size = 998.64 MiB
|
| 121 |
+
..............................................................................................
|
| 122 |
+
llama_context: constructing llama_context
|
| 123 |
+
llama_context: n_seq_max = 1
|
| 124 |
+
llama_context: n_ctx = 2048
|
| 125 |
+
llama_context: n_ctx_seq = 2048
|
| 126 |
+
llama_context: n_batch = 2048
|
| 127 |
+
llama_context: n_ubatch = 512
|
| 128 |
+
llama_context: causal_attn = 1
|
| 129 |
+
llama_context: flash_attn = auto
|
| 130 |
+
llama_context: kv_unified = false
|
| 131 |
+
llama_context: freq_base = 5000000.0
|
| 132 |
+
llama_context: freq_scale = 1
|
| 133 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 134 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 135 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 138 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 139 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 140 |
+
llama_context: CUDA0 compute buffer size = 606.03 MiB
|
| 141 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 142 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 143 |
+
llama_context: graph nodes = 1267
|
| 144 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 145 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 150 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 151 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 152 |
+
|
| 153 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 154 |
+
perplexity: tokenizing the input ..
|
| 155 |
+
perplexity: tokenization took 112.131 ms
|
| 156 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 157 |
+
perplexity: 1.22 seconds per pass - ETA 0.88 minutes
|
| 158 |
+
[1]3.1467,[2]2.4678,[3]1.8270,[4]1.6835,[5]1.8011,[6]1.8538,[7]1.8089,[8]1.7804,[9]1.6979,[10]1.6423,[11]1.6091,[12]1.6108,[13]1.5769,[14]1.5546,[15]1.5762,[16]1.5543,[17]1.5416,[18]1.5486,[19]1.5342,[20]1.5144,[21]1.5066,[22]1.5029,[23]1.5243,[24]1.5116,[25]1.5172,[26]1.4998,[27]1.4905,[28]1.4888,[29]1.5044,[30]1.5080,[31]1.4980,[32]1.4874,[33]1.4902,[34]1.4876,[35]1.4868,[36]1.5139,[37]1.5240,[38]1.5296,[39]1.5369,[40]1.5381,[41]1.5316,[42]1.5459,[43]1.5474,[44]1.5484,
|
| 159 |
+
Final estimate: PPL = 1.5484 +/- 0.01225
|
| 160 |
+
|
| 161 |
+
llama_perf_context_print: load time = 800.23 ms
|
| 162 |
+
llama_perf_context_print: prompt eval time = 44108.19 ms / 90112 tokens ( 0.49 ms per token, 2042.98 tokens per second)
|
| 163 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 164 |
+
llama_perf_context_print: total time = 45289.03 ms / 90113 tokens
|
| 165 |
+
llama_perf_context_print: graphs reused = 0
|
| 166 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 19167 + (1684 = 998 + 80 + 606) + 3254 |
|
| 168 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21807 + (1152 = 998 + 80 + 74) + 1163 |
|
| 169 |
+
llama_memory_breakdown_print: | - Host | 2039 = 1902 + 128 + 9 |
|
Benchmarks/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q6_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20952 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /mnt/world7/AI/Models/GGUF/Qwen3-4B-Instruct-2507/Qwen3-4B-Instruct-2507-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = qwen3
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Qwen3 4B Instruct 2507
|
| 14 |
+
llama_model_loader: - kv 3: general.version str = 2507
|
| 15 |
+
llama_model_loader: - kv 4: general.finetune str = Instruct
|
| 16 |
+
llama_model_loader: - kv 5: general.basename str = Qwen3
|
| 17 |
+
llama_model_loader: - kv 6: general.size_label str = 4B
|
| 18 |
+
llama_model_loader: - kv 7: general.license str = apache-2.0
|
| 19 |
+
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B-...
|
| 20 |
+
llama_model_loader: - kv 9: general.tags arr[str,1] = ["text-generation"]
|
| 21 |
+
llama_model_loader: - kv 10: qwen3.block_count u32 = 36
|
| 22 |
+
llama_model_loader: - kv 11: qwen3.context_length u32 = 262144
|
| 23 |
+
llama_model_loader: - kv 12: qwen3.embedding_length u32 = 2560
|
| 24 |
+
llama_model_loader: - kv 13: qwen3.feed_forward_length u32 = 9728
|
| 25 |
+
llama_model_loader: - kv 14: qwen3.attention.head_count u32 = 32
|
| 26 |
+
llama_model_loader: - kv 15: qwen3.attention.head_count_kv u32 = 8
|
| 27 |
+
llama_model_loader: - kv 16: qwen3.rope.freq_base f32 = 5000000.000000
|
| 28 |
+
llama_model_loader: - kv 17: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 29 |
+
llama_model_loader: - kv 18: qwen3.attention.key_length u32 = 128
|
| 30 |
+
llama_model_loader: - kv 19: qwen3.attention.value_length u32 = 128
|
| 31 |
+
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
|
| 32 |
+
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
|
| 41 |
+
llama_model_loader: - kv 30: general.quantization_version u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: general.file_type u32 = 38
|
| 43 |
+
llama_model_loader: - type f32: 145 tensors
|
| 44 |
+
llama_model_loader: - type q8_0: 216 tensors
|
| 45 |
+
llama_model_loader: - type q6_K: 37 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 3.81 GiB (8.13 BPW)
|
| 49 |
+
load: printing all EOG tokens:
|
| 50 |
+
load: - 151643 ('<|endoftext|>')
|
| 51 |
+
load: - 151645 ('<|im_end|>')
|
| 52 |
+
load: - 151662 ('<|fim_pad|>')
|
| 53 |
+
load: - 151663 ('<|repo_name|>')
|
| 54 |
+
load: - 151664 ('<|file_sep|>')
|
| 55 |
+
load: special tokens cache size = 26
|
| 56 |
+
load: token to piece cache size = 0.9311 MB
|
| 57 |
+
print_info: arch = qwen3
|
| 58 |
+
print_info: vocab_only = 0
|
| 59 |
+
print_info: n_ctx_train = 262144
|
| 60 |
+
print_info: n_embd = 2560
|
| 61 |
+
print_info: n_embd_inp = 2560
|
| 62 |
+
print_info: n_layer = 36
|
| 63 |
+
print_info: n_head = 32
|
| 64 |
+
print_info: n_head_kv = 8
|
| 65 |
+
print_info: n_rot = 128
|
| 66 |
+
print_info: n_swa = 0
|
| 67 |
+
print_info: is_swa_any = 0
|
| 68 |
+
print_info: n_embd_head_k = 128
|
| 69 |
+
print_info: n_embd_head_v = 128
|
| 70 |
+
print_info: n_gqa = 4
|
| 71 |
+
print_info: n_embd_k_gqa = 1024
|
| 72 |
+
print_info: n_embd_v_gqa = 1024
|
| 73 |
+
print_info: f_norm_eps = 0.0e+00
|
| 74 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 75 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 76 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 77 |
+
print_info: f_logit_scale = 0.0e+00
|
| 78 |
+
print_info: f_attn_scale = 0.0e+00
|
| 79 |
+
print_info: n_ff = 9728
|
| 80 |
+
print_info: n_expert = 0
|
| 81 |
+
print_info: n_expert_used = 0
|
| 82 |
+
print_info: n_expert_groups = 0
|
| 83 |
+
print_info: n_group_used = 0
|
| 84 |
+
print_info: causal attn = 1
|
| 85 |
+
print_info: pooling type = -1
|
| 86 |
+
print_info: rope type = 2
|
| 87 |
+
print_info: rope scaling = linear
|
| 88 |
+
print_info: freq_base_train = 5000000.0
|
| 89 |
+
print_info: freq_scale_train = 1
|
| 90 |
+
print_info: n_ctx_orig_yarn = 262144
|
| 91 |
+
print_info: rope_finetuned = unknown
|
| 92 |
+
print_info: model type = 4B
|
| 93 |
+
print_info: model params = 4.02 B
|
| 94 |
+
print_info: general.name = Qwen3 4B Instruct 2507
|
| 95 |
+
print_info: vocab type = BPE
|
| 96 |
+
print_info: n_vocab = 151936
|
| 97 |
+
print_info: n_merges = 151387
|
| 98 |
+
print_info: BOS token = 151643 '<|endoftext|>'
|
| 99 |
+
print_info: EOS token = 151645 '<|im_end|>'
|
| 100 |
+
print_info: EOT token = 151645 '<|im_end|>'
|
| 101 |
+
print_info: PAD token = 151643 '<|endoftext|>'
|
| 102 |
+
print_info: LF token = 198 'Ċ'
|
| 103 |
+
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
|
| 104 |
+
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
|
| 105 |
+
print_info: FIM MID token = 151660 '<|fim_middle|>'
|
| 106 |
+
print_info: FIM PAD token = 151662 '<|fim_pad|>'
|
| 107 |
+
print_info: FIM REP token = 151663 '<|repo_name|>'
|
| 108 |
+
print_info: FIM SEP token = 151664 '<|file_sep|>'
|
| 109 |
+
print_info: EOG token = 151643 '<|endoftext|>'
|
| 110 |
+
print_info: EOG token = 151645 '<|im_end|>'
|
| 111 |
+
print_info: EOG token = 151662 '<|fim_pad|>'
|
| 112 |
+
print_info: EOG token = 151663 '<|repo_name|>'
|
| 113 |
+
print_info: EOG token = 151664 '<|file_sep|>'
|
| 114 |
+
print_info: max token length = 256
|
| 115 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 116 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 117 |
+
load_tensors: offloaded 20/37 layers to GPU
|
| 118 |
+
load_tensors: CPU_Mapped model buffer size = 1902.12 MiB
|
| 119 |
+
load_tensors: CUDA0 model buffer size = 998.64 MiB
|
| 120 |
+
load_tensors: CUDA1 model buffer size = 998.64 MiB
|
| 121 |
+
..............................................................................................
|
| 122 |
+
llama_context: constructing llama_context
|
| 123 |
+
llama_context: n_seq_max = 1
|
| 124 |
+
llama_context: n_ctx = 2048
|
| 125 |
+
llama_context: n_ctx_seq = 2048
|
| 126 |
+
llama_context: n_batch = 2048
|
| 127 |
+
llama_context: n_ubatch = 512
|
| 128 |
+
llama_context: causal_attn = 1
|
| 129 |
+
llama_context: flash_attn = auto
|
| 130 |
+
llama_context: kv_unified = false
|
| 131 |
+
llama_context: freq_base = 5000000.0
|
| 132 |
+
llama_context: freq_scale = 1
|
| 133 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
|
| 134 |
+
llama_context: CPU output buffer size = 0.58 MiB
|
| 135 |
+
llama_kv_cache: CPU KV buffer size = 128.00 MiB
|
| 136 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 137 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 138 |
+
llama_kv_cache: size = 288.00 MiB ( 2048 cells, 36 layers, 1/1 seqs), K (f16): 144.00 MiB, V (f16): 144.00 MiB
|
| 139 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 140 |
+
llama_context: CUDA0 compute buffer size = 606.03 MiB
|
| 141 |
+
llama_context: CUDA1 compute buffer size = 74.01 MiB
|
| 142 |
+
llama_context: CUDA_Host compute buffer size = 9.01 MiB
|
| 143 |
+
llama_context: graph nodes = 1267
|
| 144 |
+
llama_context: graph splits = 213 (with bs=512), 52 (with bs=1)
|
| 145 |
+
common_init_from_params: added <|endoftext|> logit bias = -inf
|
| 146 |
+
common_init_from_params: added <|im_end|> logit bias = -inf
|
| 147 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 148 |
+
common_init_from_params: added <|repo_name|> logit bias = -inf
|
| 149 |
+
common_init_from_params: added <|file_sep|> logit bias = -inf
|
| 150 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 151 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 152 |
+
|
| 153 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 154 |
+
perplexity: tokenizing the input ..
|
| 155 |
+
perplexity: tokenization took 49 ms
|
| 156 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 157 |
+
perplexity: 1.22 seconds per pass - ETA 0.30 minutes
|
| 158 |
+
[1]8.2417,[2]10.3328,[3]10.7044,[4]10.3741,[5]10.0981,[6]8.6204,[7]7.7392,[8]7.7231,[9]8.1445,[10]8.2802,[11]8.3005,[12]8.6357,[13]8.6745,[14]8.8101,[15]8.8802,
|
| 159 |
+
Final estimate: PPL = 8.8802 +/- 0.20524
|
| 160 |
+
|
| 161 |
+
llama_perf_context_print: load time = 800.00 ms
|
| 162 |
+
llama_perf_context_print: prompt eval time = 15189.70 ms / 30720 tokens ( 0.49 ms per token, 2022.42 tokens per second)
|
| 163 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 164 |
+
llama_perf_context_print: total time = 15602.35 ms / 30721 tokens
|
| 165 |
+
llama_perf_context_print: graphs reused = 0
|
| 166 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 167 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 19168 + (1684 = 998 + 80 + 606) + 3254 |
|
| 168 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 21807 + (1152 = 998 + 80 + 74) + 1163 |
|
| 169 |
+
llama_memory_breakdown_print: | - Host | 2039 = 1902 + 128 + 9 |
|