julioc-p commited on
Commit
a9bc9e0
·
verified ·
1 Parent(s): 35a605d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +237 -158
README.md CHANGED
@@ -1,202 +1,281 @@
1
  ---
2
- base_model: mistralai/Mistral-7B-Instruct-v0.1
3
  library_name: peft
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
 
12
  ## Model Details
13
 
14
  ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
-
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
 
 
 
57
 
58
  ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
  ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
 
93
  #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
  ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
  ### Testing Data, Factors & Metrics
108
 
109
  #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
 
121
  #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
 
127
  ### Results
128
 
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
 
 
 
 
 
 
140
 
141
  ## Environmental Impact
 
 
 
 
 
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
 
159
  ### Compute Infrastructure
160
 
161
- [More Information Needed]
162
-
163
  #### Hardware
164
-
165
- [More Information Needed]
166
 
167
  #### Software
 
 
168
 
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
200
  ### Framework versions
201
-
202
- - PEFT 0.15.2
 
 
 
 
1
  ---
2
+ base_model: mistralai/Mistral-7B-v0.1
3
  library_name: peft
4
+ license: mit
5
+ datasets:
6
+ - julioc-p/Question-Sparql
7
+ language:
8
+ - de
9
+ - en
10
+ metrics:
11
+ - f1
12
+ - precision
13
+ - recall
14
+ tags:
15
+ - code
16
+ - text-to-sparql
17
+ - sparql
18
+ - wikidata
19
+ - german
20
+ - qlora
21
  ---
22
 
23
+ This model is a fine-tuned version of `mistralai/Mistral-7B-v0.1` for generating SPARQL queries from German natural language questions, specifically targeting the Wikidata knowledge graph.
 
 
 
 
24
 
25
  ## Model Details
26
 
27
  ### Model Description
28
 
29
+ It was fine-tuned using QLoRA with 4-bit quantization. It takes a German natural language question and corresponding entity/relationship context as input and aims to produce a SPARQL query for Wikidata. This model is part of experiments investigating continual multilingual pre-training.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
+ - **Developed by:** Julio Cesar Perez Duran
32
+ - **Funded by :** DFKI
33
+ - **Model type:** Decoder-only Transformer-based language model
34
+ - **Language:** de (German)
35
+ - **License:** mit
36
+ - **Finetuned from model [optional]:** `mistralai/Mistral-7B-v0.1`
37
 
38
  ## Bias, Risks, and Limitations
39
 
40
+ - **Context Reliant:** Performance heavily relies on the accuracy and completeness of the provided entity/relationship context mappings.
41
+ - **Output Format:** V2 models sometimes generated extraneous text after the SPARQL query, requiring post-processing (extraction of content within ` ```sparql ... ``` ` delimiters).
42
+ - **EOS Token Generation:** Inconsistent End-Of-Sequence token generation was observed, possibly influenced by dataset packing during training.
 
 
 
 
 
 
43
 
44
  ## How to Get Started with the Model
45
 
46
+ The following Python script provides an example of how to load the model and tokenizer to generate a SPARQL query.
47
+
48
+ ```python
49
+ import torch
50
+ from transformers import AutoTokenizer, BitsAndBytesConfig
51
+ from peft import AutoPeftModelForCausalLM # Use AutoPeftModelForCausalLM for v2 models
52
+ import re
53
+ import json
54
+
55
+ # Model ID for the Mistral German v2 model
56
+ model_id = "julioc-p/mistral_txt_sparql_de_v2"
57
+
58
+ # Configuration for 4-bit quantization (as per v2 setup in thesis/script)
59
+ bnb_config = BitsAndBytesConfig(
60
+ load_in_4bit=True,
61
+ bnb_4bit_quant_type="nf4",
62
+ bnb_4bit_compute_dtype=torch.float16,
63
+ bnb_4bit_use_double_quant=True,
64
+ )
65
+
66
+ # Load the model and tokenizer
67
+ print(f"Loading model: {model_id}")
68
+ model = AutoPeftModelForCausalLM.from_pretrained(
69
+ model_id,
70
+ quantization_config=bnb_config,
71
+ device_map="auto"
72
+ )
73
+ print(f"Loading tokenizer for: {model_id}")
74
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
75
+
76
+ if tokenizer.pad_token is None:
77
+ tokenizer.pad_token = tokenizer.eos_token
78
+ model.config.pad_token_id = tokenizer.pad_token_id
79
+
80
+ sparql_pattern_strict = re.compile(
81
+ r"""
82
+ (SELECT|ASK|CONSTRUCT|DESCRIBE) # Match the starting keyword
83
+ .*? # Match any characters non-greedily
84
+ \} # Match the closing brace of the main block
85
+ ( # Start of optional block for trailing clauses
86
+ (?: # Non-capturing group for one or more trailing clauses
87
+ \s*
88
+ (?: # Clause alternatives
89
+ (?:(?:GROUP|ORDER)\s+BY|HAVING)\s+.+?\s*(?=\s*(?:(?:GROUP|ORDER)\s+BY|HAVING|LIMIT|OFFSET|VALUES|$)) |
90
+ LIMIT\s+\d+ |
91
+ OFFSET\s+\d+ |
92
+ VALUES\s*(?:\{.*?\}|\w+|\(.*?\))
93
+ )
94
+ )* # Match zero or more clauses
95
+ )
96
+ """,
97
+ re.DOTALL | re.IGNORECASE | re.VERBOSE
98
+ )
99
+
100
+ def extract_sparql(text):
101
+ code_block_match = re.search(r"```(?:sparql)?\s*(.*?)\s*```", text, re.DOTALL | re.IGNORECASE)
102
+ if code_block_match:
103
+ text_to_search = code_block_match.group(1)
104
+ else:
105
+ text_to_search = text # Search directly if no markdown code block
106
+
107
+ match = sparql_pattern_strict.search(text_to_search)
108
+ if match:
109
+ return match.group(0).strip()
110
+ else:
111
+ fallback_match = re.search(r"(SELECT|ASK|CONSTRUCT|DESCRIBE).*?\}", text_to_search, re.DOTALL | re.IGNORECASE)
112
+ if fallback_match:
113
+ return fallback_match.group(0).strip()
114
+ return ""
115
+
116
+ # --- Example usage ---
117
+ question = "Wer war der amerikanische weibliche Angestellte des Barnard College?"
118
+ example_context_json_str = '''
119
+ {
120
+ "entitäten": {
121
+ "Barnard College": "Q167733",
122
+ "amerikanisch": "Q30",
123
+ "weiblich": "Q6581072",
124
+ "Angestellte": "Q5"
125
+ },
126
+ "beziehungen": {
127
+ "Instanz von": "P31",
128
+ "Arbeitgeber": "P108",
129
+ "Geschlecht": "P21",
130
+ "Land der Staatsbürgerschaft": "P27"
131
+ }
132
+ }
133
+ '''
134
+ # System prompt template for v2 models (German)
135
+ system_message_template = """Sie sind ein Experte für die Übersetzung von Text in SPARQL-Anfragen. Benutzer werden Ihnen Fragen auf Deutsch stellen, und Sie werden eine SPARQL-Anfrage basierend auf dem bereitgestellten Kontext generieren, der in ```sparql <Antwortanfrage>``` eingeschlossen ist.
136
+ KONTEXT:
137
+ {context}"""
138
+
139
+ # Format the system message with the actual context
140
+ formatted_system_message = system_message_template.format(context=example_context_json_str)
141
+
142
+ chat_template = [
143
+ {"role": "system", "content": formatted_system_message},
144
+ {"role": "user", "content": question},
145
+ ]
146
+
147
+ inputs = tokenizer.apply_chat_template(
148
+ chat_template,
149
+ tokenize=True,
150
+ add_generation_prompt=True,
151
+ return_tensors="pt"
152
+ ).to(model.device)
153
+
154
+ # Generate the output
155
+ with torch.no_grad():
156
+ outputs = model.generate(
157
+ input_ids=inputs.input_ids,
158
+ attention_mask=inputs.attention_mask,
159
+ max_new_tokens=512, # From your v2 generation script
160
+ do_sample=True, # From your v2 generation script
161
+ temperature=0.7, # Common for sampling
162
+ top_p=0.9, # Common for sampling
163
+ pad_token_id=tokenizer.pad_token_id
164
+ )
165
+
166
+ # Decode only the generated part
167
+ generated_text_full = tokenizer.decode(outputs[0], skip_special_tokens=True)
168
+ # Extract only assistant's response (ChatML format specific extraction)
169
+ assistant_response_part = ""
170
+ if "<|im_start|>assistant" in generated_text_full: # Specific to ChatML after template application
171
+ assistant_response_part = generated_text_full.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip()
172
+ elif "assistant\n" in generated_text_full: # More generic if template output varies
173
+ assistant_response_part = generated_text_full.split("assistant\n")[-1].strip()
174
+ else:
175
+ input_length = inputs.input_ids.shape[1]
176
+ assistant_response_part = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True).strip()
177
+
178
+
179
+ cleaned_sparql = extract_sparql(assistant_response_part)
180
+
181
+ print(f"Frage: {question}")
182
+ print(f"Kontext: {example_context_json_str}")
183
+ print(f"Generierte SPARQL: {cleaned_sparql}")
184
+ print(f"Textausgabe (Assistent): {assistant_response_part}")
185
+
186
+ ```
187
 
188
  ### Training Data
189
 
190
+ The model was fine-tuned on a subset of the `julioc-p/Question-Sparql` dataset. 80,000 German examples
 
 
 
 
 
 
 
 
 
 
 
191
 
192
  #### Training Hyperparameters
193
 
194
+ The following hyperparameters were used:
195
+ - **LoRA Configuration:**
196
+ - `r` (LoRA rank): 256
197
+ - `lora_alpha`: 128
198
+ - `lora_dropout`: 0.05
199
+ - `target_modules`: "all-linear"
200
+ - `task_type`: "CAUSAL_LM"
201
+ - **Training Arguments:**
202
+ - `num_train_epochs`: 3
203
+ - Effective batch size: 6 (per_device_train_batch_size=1, gradient_accumulation_steps=6)
204
+ - `optim`: "adamw_torch_fused"
205
+ - `learning_rate`: 2e-4
206
+ - `weight_decay`: 0.05
207
+ - `fp16`: True
208
+ - `max_grad_norm`: 0.3
209
+ - `warmup_ratio`: 0.03
210
+ - `lr_scheduler_type`: "constant"
211
+ - `packing`: True
212
+ - NEFTune `noise_alpha`: 5
213
+ - **BitsAndBytesConfig:**
214
+ - `load_in_4bit`: True
215
+ - `bnb_4bit_quant_type`: "nf4"
216
+ - `bnb_4bit_compute_dtype`: `torch.float16`
217
+ - `bnb_4bit_use_double_quant`: True
218
+
219
+ #### Speeds, Sizes, Times
220
+ - Mistral German v2 training took approx. 19-20 hours on a single NVIDIA V100 GPU.
221
 
222
  ## Evaluation
223
 
 
 
224
  ### Testing Data, Factors & Metrics
225
 
226
  #### Testing Data
227
+ 1. **QALD-10 test set (German):** Standardized benchmark. 394 German questions were evaluated for this model.
228
+ 2. **v2 Test Set (German):** 10,000 German held-out examples from the `julioc-p/Question-Sparql` dataset, including context.
 
 
 
 
 
 
 
 
229
 
230
  #### Metrics
231
+ QALD standard macro-averaged F1-score, Precision, and Recall. Non-executable queries: P, R, F1 = 0. Executable Queries % tracked.
 
 
 
232
 
233
  ### Results
234
 
235
+ **On QALD-10 (German, N=394):**
236
+ - **Macro F1-Score:** 0.2595
237
+ - **Macro Precision:** 0.6419
238
+ - **Macro Recall:** 0.2632
239
+ - **Executable Queries:** 99.75% (393/394)
240
+ - **Correctness (Exact Match + Both Empty):** 25.13% (99/394)
241
+ - Correct (Exact Match): 23.86% (94/394)
242
+ - Correct (Both Empty): 1.27% (5/394)
243
+
244
+ **On v2 Test Set (German, N=10000):**
245
+ - **Macro F1-Score:** 0.7183
246
+ - **Macro Precision:** 0.8362
247
+ - **Macro Recall:** 0.7198
248
+ - **Executable Queries:** 97.27% (9727/10000)
249
+ - **Correctness (Exact Match + Both Empty):** 71.58% (7158/10000)
250
+ - Correct (Exact Match): 62.74% (6274/10000)
251
+ - Correct (Both Empty): 8.84% (884/10000)
252
 
253
  ## Environmental Impact
254
+ - **Hardware Type:** 1 x NVIDIA V100 32GB GPU
255
+ - **Hours used:** Approx. 19-20 hours for fine-tuning.
256
+ - **Cloud Provider:** DFKI HPC Cluster
257
+ - **Compute Region:** Germany
258
+ - **Carbon Emitted:** Approx. 2.96 kg CO2eq.
259
 
260
+ ## Technical Specifications
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
 
262
  ### Compute Infrastructure
263
 
 
 
264
  #### Hardware
265
+ - NVIDIA V100 GPU (32 GB RAM)
266
+ - Approx. 60 GB system RAM
267
 
268
  #### Software
269
+ - Slurm, NVIDIA Enroot, CUDA 11.8.0
270
+ - Python, Hugging Face `transformers`, `peft` (0.13.2), `bitsandbytes`, `trl`, PyTorch.
271
 
272
+ ## More Information
273
+ - **Thesis GitHub:** [https://github.com/julioc-p/cross-lingual-transferability-thesis](https://github.com/julioc-p/cross-lingual-transferability-thesis)
274
+ - **Dataset:** [https://huggingface.co/datasets/julioc-p/Question-Sparql](https://huggingface.co/datasets/julioc-p/Question-Sparql)
275
+ -
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
276
  ### Framework versions
277
+ - PEFT 0.13.2
278
+ - Transformers (`4.39.3`)
279
+ - BitsAndBytes (`0.43.0`)
280
+ - trl (`0.8.6`)
281
+ - PyTorch (`torch==2.1.0`)