Improve model card: Add pipeline tag, library name, update license, and correct paper link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +63 -22
README.md CHANGED
@@ -1,33 +1,41 @@
1
  ---
2
- license: mit
3
  tags:
4
  - decompile
5
  - binary
6
  widget:
7
- - text: "# This is the assembly code:\nfloat func0(float param_1)\n\n{\n return param_1 - (float)(int)param_1;\n}# What is the source code?\n"
 
 
 
8
  ---
9
 
 
10
 
11
- ### 1. Introduction of LLM4Decompile
12
 
13
- LLM4Decompile aims to decompile x86 assembly instructions into C. The newly released V2 series are trained with a larger dataset (2B tokens) and a maximum token length of 4,096, with remarkable performance (up to 100% improvement) compared to the previous model.
14
 
15
- - **Github Repository:** [LLM4Decompile](https://github.com/albertan017/LLM4Decompile)
16
 
 
17
 
18
- ### 2. Evaluation Results
 
 
 
19
 
20
  | Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
21
- |:-----------------------:|:---------------------:|:-------:|:-------:|:-------:|:-------:|:---------------:|:-------:|:-------:|:-------:|:-------:|
22
- | Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG |
23
- | LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353 |
24
- | Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620 |
25
- | +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572 |
26
- | +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350 |
27
- | +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382 |
28
  | +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
29
 
30
- ### 3. How to Use
31
  Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
32
 
33
  1. Install Ghidra
@@ -95,7 +103,8 @@ with tempfile.TemporaryDirectory() as temp_dir:
95
  c_decompile = f.read()
96
  c_func = []
97
  flag = 0
98
- for line in c_decompile.split('\n'):
 
99
  if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
100
  flag = 1
101
  c_func.append(line)
@@ -111,10 +120,14 @@ with tempfile.TemporaryDirectory() as temp_dir:
111
  if 'func0' in c_func[idx_tmp]:
112
  break
113
  c_func = c_func[idx_tmp:]
114
- input_asm = '\n'.join(c_func).strip()
115
-
116
- before = f"# This is the assembly code:\n"#prompt
117
- after = "\n# What is the source code?\n"#prompt
 
 
 
 
118
  input_asm_prompt = before+input_asm.strip()+after
119
  with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
120
  f.write(input_asm_prompt)
@@ -165,14 +178,42 @@ c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
165
  with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
166
  func = f.read()
167
 
168
- print(f'pseudo function:\n{func}')# Note we only decompile one function, where the original file may contain multiple functions
169
- print(f'refined function:\n{c_func_decompile}')
 
 
170
 
171
  ```
172
 
173
  ### 4. License
174
- This code repository is licensed under the MIT License.
175
 
176
  ### 5. Contact
177
 
178
  If you have any questions, please raise an issue.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: other
3
  tags:
4
  - decompile
5
  - binary
6
  widget:
7
+ - text: "# This is the assembly code:\nfloat func0(float param_1)\n\n{\n return param_1\
8
+ \ - (float)(int)param_1;\n}# What is the source code?\n"
9
+ pipeline_tag: text-generation
10
+ library_name: transformers
11
  ---
12
 
13
+ # LLM4Binary/llm4decompile-6.7b-v2
14
 
15
+ This repository hosts the **LLM4Binary/llm4decompile-6.7b-v2** model, a part of the LLM4Decompile series. This model was presented in the paper [Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation](https://huggingface.co/papers/2505.12668).
16
 
17
+ **GitHub Repository:** [LLM4Decompile](https://github.com/albertan017/LLM4Decompile)
18
 
19
+ ## About LLM4Decompile
20
 
21
+ LLM4Decompile is a pioneering open-source large language model dedicated to binary decompilation. Its current version supports decompiling Linux x86_64 binaries, ranging from GCC's O0 to O3 optimization levels, into human-readable C source code. The V2 series models, including `llm4decompile-6.7b-v2`, are specifically designed to **refine** the pseudo-code outputted by tools like Ghidra, having been trained on 2 billion tokens with a maximum context length of 4,096.
22
 
23
+ - **LLM4Decompile-End** focuses on decompiling the binary directly.
24
+ - **LLM4Decompile-Ref** (like this `v2` model) refines the pseudo-code decompiled by Ghidra.
25
+
26
+ ## Evaluation Results
27
 
28
  | Metrics | Re-executability Rate | | | | | Edit Similarity | | | | |
29
+ |:-----------------------:|:---------------------:|:-------:|:-------:|:-------:|:-------:|:---------------:|:-------:|:-------:|:-------:|:-------:|\
30
+ | Optimization Level | O0 | O1 | O2 | O3 | AVG | O0 | O1 | O2 | O3 | AVG |\
31
+ | LLM4Decompile-End-6.7B | 0.6805 | 0.3951 | 0.3671 | 0.3720 | 0.4537 | 0.1557 | 0.1292 | 0.1293 | 0.1269 | 0.1353 |\
32
+ | Ghidra | 0.3476 | 0.1646 | 0.1524 | 0.1402 | 0.2012 | 0.0699 | 0.0613 | 0.0619 | 0.0547 | 0.0620 |\
33
+ | +GPT-4o | 0.4695 | 0.3415 | 0.2866 | 0.3110 | 0.3522 | 0.0660 | 0.0563 | 0.0567 | 0.0499 | 0.0572 |\
34
+ | +LLM4Decompile-Ref-1.3B | 0.6890 | 0.3720 | 0.4085 | 0.3720 | 0.4604 | 0.1517 | 0.1325 | 0.1292 | 0.1267 | 0.1350 |\
35
+ | +LLM4Decompile-Ref-6.7B | 0.7439 | 0.4695 | 0.4756 | 0.4207 | 0.5274 | 0.1559 | 0.1353 | 0.1342 | 0.1273 | 0.1382 |\
36
  | +LLM4Decompile-Ref-33B | 0.7073 | 0.4756 | 0.4390 | 0.4146 | 0.5091 | 0.1540 | 0.1379 | 0.1363 | 0.1307 | 0.1397 |
37
 
38
+ ## How to Use
39
  Here is an example of how to use our model (Only for V2. For previous models, please check the corresponding model page at HF).
40
 
41
  1. Install Ghidra
 
103
  c_decompile = f.read()
104
  c_func = []
105
  flag = 0
106
+ for line in c_decompile.split('
107
+ '):
108
  if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
109
  flag = 1
110
  c_func.append(line)
 
120
  if 'func0' in c_func[idx_tmp]:
121
  break
122
  c_func = c_func[idx_tmp:]
123
+ input_asm = '
124
+ '.join(c_func).strip()
125
+
126
+ before = f"# This is the assembly code:
127
+ "#prompt
128
+ after = "
129
+ # What is the source code?
130
+ "#prompt
131
  input_asm_prompt = before+input_asm.strip()+after
132
  with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
133
  f.write(input_asm_prompt)
 
178
  with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
179
  func = f.read()
180
 
181
+ print(f'pseudo function:
182
+ {func}')# Note we only decompile one function, where the original file may contain multiple functions
183
+ print(f'refined function:
184
+ {c_func_decompile}')
185
 
186
  ```
187
 
188
  ### 4. License
189
+ This code repository is licensed under the MIT and DeepSeek License.
190
 
191
  ### 5. Contact
192
 
193
  If you have any questions, please raise an issue.
194
+
195
+ ## Citation
196
+ If you find this work useful, please consider citing the following papers:
197
+
198
+ ```bibtex
199
+ @misc{tan2025decompilebench,
200
+ title={Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation},
201
+ author={Hanzhuo Tan and Xiaolong Tian and Hanrui Qi and Jiaming Liu and Zuchen Gao and Siyi Wang and Qi Luo and Jing Li and Yuqun Zhang},
202
+ year={2025},
203
+ eprint={2505.12668},
204
+ archivePrefix={arXiv},
205
+ primaryClass={cs.PL},
206
+ url={https://arxiv.org/abs/2505.12668},
207
+ }
208
+ ```
209
+
210
+ ```bibtex
211
+ @misc{tan2024llm4decompile,
212
+ title={LLM4Decompile: Decompiling Binary Code with Large Language Models},
213
+ author={Hanzhuo Tan and Qi Luo and Jing Li and Yuqun Zhang},
214
+ year={2024},
215
+ eprint={2403.05286},
216
+ archivePrefix={arXiv},
217
+ primaryClass={cs.PL}
218
+ }
219
+ ```