RichardErkhov commited on
Commit
343bcdb
·
verified ·
1 Parent(s): d45cba5

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +268 -0
README.md ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ HTML-Pruner-Phi-3.8B - bnb 4bits
11
+ - Model creator: https://huggingface.co/zstanjj/
12
+ - Original model: https://huggingface.co/zstanjj/HTML-Pruner-Phi-3.8B/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language:
20
+ - en
21
+ library_name: transformers
22
+ base_model: microsoft/Phi-3.5-mini-instruct
23
+ license: apache-2.0
24
+ ---
25
+
26
+
27
+ ## Model Information
28
+
29
+ We release the HTML pruner model used in **HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems**.
30
+
31
+ <p align="left">
32
+ Useful links: 📝 <a href="https://arxiv.org/abs/2411.02959" target="_blank">Paper</a> • 🤗 <a href="https://huggingface.co/zstanjj/SlimPLM-Query-Rewriting/" target="_blank">Hugging Face</a> • 🧩 <a href="https://github.com/plageon/SlimPLM" target="_blank">Github</a>
33
+ </p>
34
+
35
+ We propose HtmlRAG, which uses HTML instead of plain text as the format of external knowledge in RAG systems. To tackle the long context brought by HTML, we propose **Lossless HTML Cleaning** and **Two-Step Block-Tree-Based HTML Pruning**.
36
+
37
+ - **Lossless HTML Cleaning**: This cleaning process just removes totally irrelevant contents and compress redundant structures, retaining all semantic information in the original HTML. The compressed HTML of lossless HTML cleaning is suitable for RAG systems that have long-context LLMs and are not willing to loss any information before generation.
38
+
39
+ - **Two-Step Block-Tree-Based HTML Pruning**: The block-tree-based HTML pruning consists of two steps, both of which are conducted on the block tree structure. The first pruning step uses a embedding model to calculate scores for blocks, while the second step uses a path generative model. The first step processes the result of lossless HTML cleaning, while the second step processes the result of the first pruning step.
40
+
41
+
42
+ 🌹 If you use this model, please ✨star our **[GitHub repository](https://github.com/plageon/HTMLRAG)** to support us. Your star means a lot!
43
+
44
+ ## 📦 Installation
45
+
46
+ Install the package using pip:
47
+ ```bash
48
+ pip install htmlrag
49
+ ```
50
+ Or install the package from source:
51
+ ```bash
52
+ pip install -e .
53
+ ```
54
+
55
+ ---
56
+
57
+ ## 📖 User Guide
58
+
59
+ ### 🧹 HTML Cleaning
60
+
61
+ ```python
62
+ from htmlrag import clean_html
63
+
64
+ question = "When was the bellagio in las vegas built?"
65
+ html = """
66
+ <html>
67
+ <head>
68
+ <h1>Bellagio Hotel in Las</h1>
69
+ </head>
70
+ <body>
71
+ <p class="class0">The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
72
+ </body>
73
+ <div>
74
+ <div>
75
+ <p>Some other text</p>
76
+ <p>Some other text</p>
77
+ </div>
78
+ </div>
79
+ <p class="class1"></p>
80
+ <!-- Some comment -->
81
+ <script type="text/javascript">
82
+ document.write("Hello World!");
83
+ </script>
84
+ </html>
85
+ """
86
+
87
+ #. alternatively you can read html files and merge them
88
+ # html_files=["/path/to/html/file1.html", "/path/to/html/file2.html"]
89
+ # htmls=[open(file).read() for file in html_files]
90
+ # html = "\n".join(htmls)
91
+
92
+ simplified_html = clean_html(html)
93
+ print(simplified_html)
94
+
95
+ # <html>
96
+ # <h1>Bellagio Hotel in Las</h1>
97
+ # <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
98
+ # <div>
99
+ # <p>Some other text</p>
100
+ # <p>Some other text</p>
101
+ # </div>
102
+ # </html>
103
+ ```
104
+
105
+ ### 🔧 Configure Pruning Parameters
106
+
107
+ The example HTML document is rather a short one. Real-world HTML documents can be much longer and more complex. To handle such cases, we can configure the following parameters:
108
+ ```python
109
+ # Maximum number of words in a node when constructing the block tree for pruning with the embedding model
110
+ MAX_NODE_WORDS_EMBED = 10
111
+ # MAX_NODE_WORDS_EMBED = 256 # a recommended setting for real-world HTML documents
112
+ # Maximum number of tokens in the output HTML document pruned with the embedding model
113
+ MAX_CONTEXT_WINDOW_EMBED = 60
114
+ # MAX_CONTEXT_WINDOW_EMBED = 6144 # a recommended setting for real-world HTML documents
115
+ # Maximum number of words in a node when constructing the block tree for pruning with the generative model
116
+ MAX_NODE_WORDS_GEN = 5
117
+ # MAX_NODE_WORDS_GEN = 128 # a recommended setting for real-world HTML documents
118
+ # Maximum number of tokens in the output HTML document pruned with the generative model
119
+ MAX_CONTEXT_WINDOW_GEN = 32
120
+ # MAX_CONTEXT_WINDOW_GEN = 4096 # a recommended setting for real-world HTML documents
121
+ ```
122
+
123
+
124
+
125
+ ### 🌲 Build Block Tree
126
+
127
+ ```python
128
+ from htmlrag import build_block_tree
129
+
130
+ block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_EMBED)
131
+ # block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # for Chinese text
132
+ for block in block_tree:
133
+ print("Block Content: ", block[0])
134
+ print("Block Path: ", block[1])
135
+ print("Is Leaf: ", block[2])
136
+ print("")
137
+
138
+ # Block Content: <h1>Bellagio Hotel in Las</h1>
139
+ # Block Path: ['html', 'title']
140
+ # Is Leaf: True
141
+ #
142
+ # Block Content: <div>
143
+ # <p>Some other text</p>
144
+ # <p>Some other text</p>
145
+ # </div>
146
+ # Block Path: ['html', 'div']
147
+ # Is Leaf: True
148
+ #
149
+ # Block Content: <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
150
+ # Block Path: ['html', 'p']
151
+ # Is Leaf: True
152
+ ```
153
+
154
+ ### ✂️ Prune HTML Blocks with Embedding Model
155
+
156
+ ```python
157
+ from htmlrag import EmbedHTMLPruner
158
+
159
+ embed_model="BAAI/bge-large-en"
160
+ query_instruction_for_retrieval = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
161
+ embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=True, query_instruction_for_retrieval = query_instruction_for_retrieval)
162
+ # alternatively you can init a remote TEI model, refer to https://github.com/huggingface/text-embeddings-inference.
163
+ # tei_endpoint="http://YOUR_TEI_ENDPOINT"
164
+ # embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=False, query_instruction_for_retrieval = query_instruction_for_retrieval, endpoint=tei_endpoint)
165
+ block_rankings=embed_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
166
+ print(block_rankings)
167
+
168
+ # [2, 0, 1]
169
+
170
+ #. alternatively you can use bm25 to rank the blocks
171
+ from htmlrag import BM25HTMLPruner
172
+ bm25_html_pruner = BM25HTMLPruner()
173
+ block_rankings = bm25_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
174
+ print(block_rankings)
175
+
176
+ # [2, 0, 1]
177
+
178
+ from transformers import AutoTokenizer
179
+
180
+ chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
181
+
182
+ pruned_html = embed_html_pruner.prune_HTML(simplified_html, block_tree, block_rankings, chat_tokenizer, MAX_CONTEXT_WINDOW_EMBED)
183
+ print(pruned_html)
184
+
185
+ # <html>
186
+ # <h1>Bellagio Hotel in Las</h1>
187
+ # <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
188
+ # </html>
189
+ ```
190
+
191
+
192
+ ### ✂️ Prune HTML Blocks with Generative Model
193
+
194
+ ```python
195
+ from htmlrag import GenHTMLPruner
196
+ import torch
197
+
198
+ # construct a finer block tree
199
+ block_tree, pruned_html = build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN)
200
+ # block_tree, pruned_html = build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # for Chinese text
201
+ for block in block_tree:
202
+ print("Block Content: ", block[0])
203
+ print("Block Path: ", block[1])
204
+ print("Is Leaf: ", block[2])
205
+ print("")
206
+
207
+ # Block Content: <h1>Bellagio Hotel in Las</h1>
208
+ # Block Path: ['html', 'title']
209
+ # Is Leaf: True
210
+ #
211
+ # Block Content: <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
212
+ # Block Path: ['html', 'p']
213
+ # Is Leaf: True
214
+
215
+ ckpt_path = "zstanjj/HTML-Pruner-Phi-3.8B"
216
+ if torch.cuda.is_available():
217
+ device="cuda"
218
+ else:
219
+ device="cpu"
220
+ gen_embed_pruner = GenHTMLPruner(gen_model=ckpt_path, device=device)
221
+ block_rankings = gen_embed_pruner.calculate_block_rankings(question, pruned_html, block_tree)
222
+ print(block_rankings)
223
+
224
+ # [1, 0]
225
+
226
+ pruned_html = gen_embed_pruner.prune_HTML(pruned_html, block_tree, block_rankings, chat_tokenizer, MAX_CONTEXT_WINDOW_GEN)
227
+ print(pruned_html)
228
+
229
+ # <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
230
+ ```
231
+
232
+ ---
233
+
234
+
235
+ ## Results
236
+
237
+ - **Results for [HTML-Pruner-Phi-3.8B](https://huggingface.co/zstanjj/HTML-Pruner-Phi-3.8B) and [HTML-Pruner-Llama-1B](https://huggingface.co/zstanjj/HTML-Pruner-Llama-1B) with Llama-3.1-70B-Instruct as chat model**.
238
+
239
+ | Dataset | ASQA | HotpotQA | NQ | TriviaQA | MuSiQue | ELI5 |
240
+ |------------------|-----------|-----------|-----------|-----------|-----------|-----------|
241
+ | Metrics | EM | EM | EM | EM | EM | ROUGE-L |
242
+ | BM25 | 49.50 | 38.25 | 47.00 | 88.00 | 9.50 | 16.15 |
243
+ | BGE | 68.00 | 41.75 | 59.50 | 93.00 | 12.50 | 16.20 |
244
+ | E5-Mistral | 63.00 | 36.75 | 59.50 | 90.75 | 11.00 | 16.17 |
245
+ | LongLLMLingua | 62.50 | 45.00 | 56.75 | 92.50 | 10.25 | 15.84 |
246
+ | JinaAI Reader | 55.25 | 34.25 | 48.25 | 90.00 | 9.25 | 16.06 |
247
+ | HtmlRAG-Phi-3.8B | **68.50** | **46.25** | 60.50 | **93.50** | **13.25** | **16.33** |
248
+ | HtmlRAG-Llama-1B | 66.50 | 45.00 | **60.75** | 93.00 | 10.00 | 16.25 |
249
+
250
+ ---
251
+
252
+ ## 📜 Citation
253
+
254
+ ```bibtex
255
+ @misc{tan2024htmlraghtmlbetterplain,
256
+ title={HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems},
257
+ author={Jiejun Tan and Zhicheng Dou and Wen Wang and Mang Wang and Weipeng Chen and Ji-Rong Wen},
258
+ year={2024},
259
+ eprint={2411.02959},
260
+ archivePrefix={arXiv},
261
+ primaryClass={cs.IR},
262
+ url={https://arxiv.org/abs/2411.02959},
263
+ }
264
+ ```
265
+
266
+
267
+
268
+