Improve model card: Add pipeline tag, library name, update license, and correct paper link
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,33 +1,41 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
| 3 |
tags:
|
| 4 |
- decompile
|
| 5 |
- binary
|
| 6 |
widget:
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
|
|
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
| 16 |
|
|
|
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
| Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
|
| 21 |
-
|
| 22 |
-
| Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG
|
| 23 |
-
| LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353
|
| 24 |
-
| Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620
|
| 25 |
-
| +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572
|
| 26 |
-
| +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350
|
| 27 |
-
| +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382
|
| 28 |
| +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
|
| 29 |
|
| 30 |
-
|
| 31 |
Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
|
| 32 |
|
| 33 |
1. Install Ghidra
|
|
@@ -95,7 +103,8 @@ with tempfile.TemporaryDirectory() as temp_dir:
|
|
| 95 |
c_decompile = f.read()
|
| 96 |
c_func = []
|
| 97 |
flag = 0
|
| 98 |
-
for line in c_decompile.split('
|
|
|
|
| 99 |
if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
|
| 100 |
flag = 1
|
| 101 |
c_func.append(line)
|
|
@@ -111,10 +120,14 @@ with tempfile.TemporaryDirectory() as temp_dir:
|
|
| 111 |
if 'func0' in c_func[idx_tmp]:
|
| 112 |
break
|
| 113 |
c_func = c_func[idx_tmp:]
|
| 114 |
-
input_asm = '
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
input_asm_prompt = before+input_asm.strip()+after
|
| 119 |
with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
|
| 120 |
f.write(input_asm_prompt)
|
|
@@ -165,14 +178,42 @@ c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
|
|
| 165 |
with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
|
| 166 |
func = f.read()
|
| 167 |
|
| 168 |
-
print(f'pseudo function
|
| 169 |
-
|
|
|
|
|
|
|
| 170 |
|
| 171 |
```
|
| 172 |
|
| 173 |
### 4. License
|
| 174 |
-
This code repository is licensed under the MIT License.
|
| 175 |
|
| 176 |
### 5. Contact
|
| 177 |
|
| 178 |
If you have any questions, please raise an issue.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: other
|
| 3 |
tags:
|
| 4 |
- decompile
|
| 5 |
- binary
|
| 6 |
widget:
|
| 7 |
+
- text: "# This is the assembly code:\nfloat func0(float param_1)\n\n{\n return param_1\
|
| 8 |
+
\ - (float)(int)param_1;\n}# What is the source code?\n"
|
| 9 |
+
pipeline_tag: text-generation
|
| 10 |
+
library_name: transformers
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# LLM4Binary/llm4decompile-6.7b-v2
|
| 14 |
|
| 15 |
+
This repository hosts the **LLM4Binary/llm4decompile-6.7b-v2** model, a part of the LLM4Decompile series. This model was presented in the paper [Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation](https://huggingface.co/papers/2505.12668).
|
| 16 |
|
| 17 |
+
**GitHub Repository:** [LLM4Decompile](https://github.com/albertan017/LLM4Decompile)
|
| 18 |
|
| 19 |
+
## About LLM4Decompile
|
| 20 |
|
| 21 |
+
LLM4Decompile is a pioneering open-source large language model dedicated to binary decompilation. Its current version supports decompiling Linux x86_64 binaries, ranging from GCC's O0 to O3 optimization levels, into human-readable C source code. The V2 series models, including `llm4decompile-6.7b-v2`, are specifically designed to **refine** the pseudo-code outputted by tools like Ghidra, having been trained on 2 billion tokens with a maximum context length of 4,096.
|
| 22 |
|
| 23 |
+
- **LLM4Decompile-End** focuses on decompiling the binary directly.
|
| 24 |
+
- **LLM4Decompile-Ref** (like this `v2` model) refines the pseudo-code decompiled by Ghidra.
|
| 25 |
+
|
| 26 |
+
## Evaluation Results
|
| 27 |
|
| 28 |
| Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
|
| 29 |
+
|:-----------------------:|:---------------------:|:-------:|:-------:|:-------:|:-------:|:---------------:|:-------:|:-------:|:-------:|:-------:|\
|
| 30 |
+
| Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG |\
|
| 31 |
+
| LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353 |\
|
| 32 |
+
| Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620 |\
|
| 33 |
+
| +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572 |\
|
| 34 |
+
| +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350 |\
|
| 35 |
+
| +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382 |\
|
| 36 |
| +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
|
| 37 |
|
| 38 |
+
## How to Use
|
| 39 |
Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
|
| 40 |
|
| 41 |
1. Install Ghidra
|
|
|
|
| 103 |
c_decompile = f.read()
|
| 104 |
c_func = []
|
| 105 |
flag = 0
|
| 106 |
+
for line in c_decompile.split('
|
| 107 |
+
'):
|
| 108 |
if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
|
| 109 |
flag = 1
|
| 110 |
c_func.append(line)
|
|
|
|
| 120 |
if 'func0' in c_func[idx_tmp]:
|
| 121 |
break
|
| 122 |
c_func = c_func[idx_tmp:]
|
| 123 |
+
input_asm = '
|
| 124 |
+
'.join(c_func).strip()
|
| 125 |
+
|
| 126 |
+
before = f"# This is the assembly code:
|
| 127 |
+
"#prompt
|
| 128 |
+
after = "
|
| 129 |
+
# What is the source code?
|
| 130 |
+
"#prompt
|
| 131 |
input_asm_prompt = before+input_asm.strip()+after
|
| 132 |
with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
|
| 133 |
f.write(input_asm_prompt)
|
|
|
|
| 178 |
with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
|
| 179 |
func = f.read()
|
| 180 |
|
| 181 |
+
print(f'pseudo function:
|
| 182 |
+
{func}')# Note we only decompile one function, where the original file may contain multiple functions
|
| 183 |
+
print(f'refined function:
|
| 184 |
+
{c_func_decompile}')
|
| 185 |
|
| 186 |
```
|
| 187 |
|
| 188 |
### 4. License
|
| 189 |
+
This code repository is licensed under the MIT and DeepSeek License.
|
| 190 |
|
| 191 |
### 5. Contact
|
| 192 |
|
| 193 |
If you have any questions, please raise an issue.
|
| 194 |
+
|
| 195 |
+
## Citation
|
| 196 |
+
If you find this work useful, please consider citing the following papers:
|
| 197 |
+
|
| 198 |
+
```bibtex
|
| 199 |
+
@misc{tan2025decompilebench,
|
| 200 |
+
title={Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation},
|
| 201 |
+
author={Hanzhuo Tan and Xiaolong Tian and Hanrui Qi and Jiaming Liu and Zuchen Gao and Siyi Wang and Qi Luo and Jing Li and Yuqun Zhang},
|
| 202 |
+
year={2025},
|
| 203 |
+
eprint={2505.12668},
|
| 204 |
+
archivePrefix={arXiv},
|
| 205 |
+
primaryClass={cs.PL},
|
| 206 |
+
url={https://arxiv.org/abs/2505.12668},
|
| 207 |
+
}
|
| 208 |
+
```
|
| 209 |
+
|
| 210 |
+
```bibtex
|
| 211 |
+
@misc{tan2024llm4decompile,
|
| 212 |
+
title={LLM4Decompile: Decompiling Binary Code with Large Language Models},
|
| 213 |
+
author={Hanzhuo Tan and Qi Luo and Jing Li and Yuqun Zhang},
|
| 214 |
+
year={2024},
|
| 215 |
+
eprint={2403.05286},
|
| 216 |
+
archivePrefix={arXiv},
|
| 217 |
+
primaryClass={cs.PL}
|
| 218 |
+
}
|
| 219 |
+
```
|