Improve model card: Add pipeline tag, library name, update license, and correct paper link
Browse filesThis PR significantly enhances the model card for `LLM4Binary/llm4decompile-6.7b-v2` by:
* Adding `pipeline_tag: text-generation` to improve discoverability on the Hugging Face Hub for code generation tasks.
* Adding `library_name: transformers` to enable the automated "How to use" widget, as the model explicitly uses the `transformers` library for inference.
* Updating the metadata `license` to `other` and the content's license section to "MIT and DeepSeek License" to accurately reflect the dual licensing mentioned in the GitHub repository.
* Ensuring the primary paper link, "[Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation](https://huggingface.co/papers/2505.12668)", is prominently displayed at the top.
* Integrating a more detailed "About" section from the GitHub README to provide better context for the model.
* Adding a comprehensive "Citation" section for both the Decompile-Bench and LLM4Decompile papers.
The existing detailed "How to Use" section, which includes Ghidra setup and model inference, is preserved as it directly reflects the official usage instructions. All code snippets and literal newline characters (`\n`) in the widget text have been carefully maintained as per instructions.
Please review and merge this PR.
|
@@ -1,33 +1,41 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
| 3 |
tags:
|
| 4 |
- decompile
|
| 5 |
- binary
|
| 6 |
widget:
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
|
|
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
| 16 |
|
|
|
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
| Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
|
| 21 |
-
|
| 22 |
-
| Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG
|
| 23 |
-
| LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353
|
| 24 |
-
| Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620
|
| 25 |
-
| +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572
|
| 26 |
-
| +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350
|
| 27 |
-
| +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382
|
| 28 |
| +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
|
| 29 |
|
| 30 |
-
|
| 31 |
Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
|
| 32 |
|
| 33 |
1. Install Ghidra
|
|
@@ -95,7 +103,8 @@ with tempfile.TemporaryDirectory() as temp_dir:
|
|
| 95 |
c_decompile = f.read()
|
| 96 |
c_func = []
|
| 97 |
flag = 0
|
| 98 |
-
for line in c_decompile.split('
|
|
|
|
| 99 |
if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
|
| 100 |
flag = 1
|
| 101 |
c_func.append(line)
|
|
@@ -111,10 +120,14 @@ with tempfile.TemporaryDirectory() as temp_dir:
|
|
| 111 |
if 'func0' in c_func[idx_tmp]:
|
| 112 |
break
|
| 113 |
c_func = c_func[idx_tmp:]
|
| 114 |
-
input_asm = '
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
input_asm_prompt = before+input_asm.strip()+after
|
| 119 |
with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
|
| 120 |
f.write(input_asm_prompt)
|
|
@@ -165,14 +178,42 @@ c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
|
|
| 165 |
with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
|
| 166 |
func = f.read()
|
| 167 |
|
| 168 |
-
print(f'pseudo function
|
| 169 |
-
|
|
|
|
|
|
|
| 170 |
|
| 171 |
```
|
| 172 |
|
| 173 |
### 4. License
|
| 174 |
-
This code repository is licensed under the MIT License.
|
| 175 |
|
| 176 |
### 5. Contact
|
| 177 |
|
| 178 |
If you have any questions, please raise an issue.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: other
|
| 3 |
tags:
|
| 4 |
- decompile
|
| 5 |
- binary
|
| 6 |
widget:
|
| 7 |
+
- text: "# This is the assembly code:\nfloat func0(float param_1)\n\n{\n return param_1\
|
| 8 |
+
\ - (float)(int)param_1;\n}# What is the source code?\n"
|
| 9 |
+
pipeline_tag: text-generation
|
| 10 |
+
library_name: transformers
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# LLM4Binary/llm4decompile-6.7b-v2
|
| 14 |
|
| 15 |
+
This repository hosts the **LLM4Binary/llm4decompile-6.7b-v2** model, a part of the LLM4Decompile series. This model was presented in the paper [Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation](https://huggingface.co/papers/2505.12668).
|
| 16 |
|
| 17 |
+
**GitHub Repository:** [LLM4Decompile](https://github.com/albertan017/LLM4Decompile)
|
| 18 |
|
| 19 |
+
## About LLM4Decompile
|
| 20 |
|
| 21 |
+
LLM4Decompile is a pioneering open-source large language model dedicated to binary decompilation. Its current version supports decompiling Linux x86_64 binaries, ranging from GCC's O0 to O3 optimization levels, into human-readable C source code. The V2 series models, including `llm4decompile-6.7b-v2`, are specifically designed to **refine** the pseudo-code outputted by tools like Ghidra, having been trained on 2 billion tokens with a maximum context length of 4,096.
|
| 22 |
|
| 23 |
+
- **LLM4Decompile-End** focuses on decompiling the binary directly.
|
| 24 |
+
- **LLM4Decompile-Ref** (like this `v2` model) refines the pseudo-code decompiled by Ghidra.
|
| 25 |
+
|
| 26 |
+
## Evaluation Results
|
| 27 |
|
| 28 |
| Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
|
| 29 |
+
|:-----------------------:|:---------------------:|:-------:|:-------:|:-------:|:-------:|:---------------:|:-------:|:-------:|:-------:|:-------:|\
|
| 30 |
+
| Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG |\
|
| 31 |
+
| LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353 |\
|
| 32 |
+
| Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620 |\
|
| 33 |
+
| +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572 |\
|
| 34 |
+
| +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350 |\
|
| 35 |
+
| +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382 |\
|
| 36 |
| +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
|
| 37 |
|
| 38 |
+
## How to Use
|
| 39 |
Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
|
| 40 |
|
| 41 |
1. Install Ghidra
|
|
|
|
| 103 |
c_decompile = f.read()
|
| 104 |
c_func = []
|
| 105 |
flag = 0
|
| 106 |
+
for line in c_decompile.split('
|
| 107 |
+
'):
|
| 108 |
if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
|
| 109 |
flag = 1
|
| 110 |
c_func.append(line)
|
|
|
|
| 120 |
if 'func0' in c_func[idx_tmp]:
|
| 121 |
break
|
| 122 |
c_func = c_func[idx_tmp:]
|
| 123 |
+
input_asm = '
|
| 124 |
+
'.join(c_func).strip()
|
| 125 |
+
|
| 126 |
+
before = f"# This is the assembly code:
|
| 127 |
+
"#prompt
|
| 128 |
+
after = "
|
| 129 |
+
# What is the source code?
|
| 130 |
+
"#prompt
|
| 131 |
input_asm_prompt = before+input_asm.strip()+after
|
| 132 |
with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
|
| 133 |
f.write(input_asm_prompt)
|
|
|
|
| 178 |
with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
|
| 179 |
func = f.read()
|
| 180 |
|
| 181 |
+
print(f'pseudo function:
|
| 182 |
+
{func}')# Note we only decompile one function, where the original file may contain multiple functions
|
| 183 |
+
print(f'refined function:
|
| 184 |
+
{c_func_decompile}')
|
| 185 |
|
| 186 |
```
|
| 187 |
|
| 188 |
### 4. License
|
| 189 |
+
This code repository is licensed under the MIT and DeepSeek License.
|
| 190 |
|
| 191 |
### 5. Contact
|
| 192 |
|
| 193 |
If you have any questions, please raise an issue.
|
| 194 |
+
|
| 195 |
+
## Citation
|
| 196 |
+
If you find this work useful, please consider citing the following papers:
|
| 197 |
+
|
| 198 |
+
```bibtex
|
| 199 |
+
@misc{tan2025decompilebench,
|
| 200 |
+
title={Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation},
|
| 201 |
+
author={Hanzhuo Tan and Xiaolong Tian and Hanrui Qi and Jiaming Liu and Zuchen Gao and Siyi Wang and Qi Luo and Jing Li and Yuqun Zhang},
|
| 202 |
+
year={2025},
|
| 203 |
+
eprint={2505.12668},
|
| 204 |
+
archivePrefix={arXiv},
|
| 205 |
+
primaryClass={cs.PL},
|
| 206 |
+
url={https://arxiv.org/abs/2505.12668},
|
| 207 |
+
}
|
| 208 |
+
```
|
| 209 |
+
|
| 210 |
+
```bibtex
|
| 211 |
+
@misc{tan2024llm4decompile,
|
| 212 |
+
title={LLM4Decompile: Decompiling Binary Code with Large Language Models},
|
| 213 |
+
author={Hanzhuo Tan and Qi Luo and Jing Li and Yuqun Zhang},
|
| 214 |
+
year={2024},
|
| 215 |
+
eprint={2403.05286},
|
| 216 |
+
archivePrefix={arXiv},
|
| 217 |
+
primaryClass={cs.PL}
|
| 218 |
+
}
|
| 219 |
+
```
|