--- base_model: mistralai/Mistral-7B-v0.1 library_name: peft license: mit datasets: - julioc-p/Question-Sparql language: - de - en metrics: - f1 - precision - recall tags: - code - text-to-sparql - sparql - wikidata - german - qlora --- This model is a fine-tuned version of `mistralai/Mistral-7B-v0.1` for generating SPARQL queries from German natural language questions, specifically targeting the Wikidata knowledge graph. ## Model Details ### Model Description It was fine-tuned using QLoRA with 4-bit quantization. It takes a German natural language question and corresponding entity/relationship context as input and aims to produce a SPARQL query for Wikidata. This model is part of experiments investigating continual multilingual pre-training. - **Developed by:** Julio Cesar Perez Duran - **Funded by :** DFKI - **Model type:** Decoder-only Transformer-based language model - **Language:** de (German) - **License:** mit - **Finetuned from model [optional]:** `mistralai/Mistral-7B-v0.1` ## Bias, Risks, and Limitations - **Context Reliant:** Performance heavily relies on the accuracy and completeness of the provided entity/relationship context mappings. - **Output Format:** V2 models sometimes generated extraneous text after the SPARQL query, requiring post-processing (extraction of content within ` ```sparql ... ``` ` delimiters). - **EOS Token Generation:** Inconsistent End-Of-Sequence token generation was observed, possibly influenced by dataset packing during training. ## How to Get Started with the Model The following Python script provides an example of how to load the model and tokenizer to generate a SPARQL query. ```python import torch from transformers import AutoTokenizer, BitsAndBytesConfig from peft import AutoPeftModelForCausalLM # Use AutoPeftModelForCausalLM for v2 models import re import json # Model ID for the Mistral German v2 model model_id = "julioc-p/mistral_txt_sparql_de_v2" # Configuration for 4-bit quantization (as per v2 setup in thesis/script) bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, ) # Load the model and tokenizer print(f"Loading model: {model_id}") model = AutoPeftModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto" ) print(f"Loading tokenizer for: {model_id}") tokenizer = AutoTokenizer.from_pretrained(model_id) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token model.config.pad_token_id = tokenizer.pad_token_id sparql_pattern_strict = re.compile( r""" (SELECT|ASK|CONSTRUCT|DESCRIBE) # Match the starting keyword .*? # Match any characters non-greedily \} # Match the closing brace of the main block ( # Start of optional block for trailing clauses (?: # Non-capturing group for one or more trailing clauses \s* (?: # Clause alternatives (?:(?:GROUP|ORDER)\s+BY|HAVING)\s+.+?\s*(?=\s*(?:(?:GROUP|ORDER)\s+BY|HAVING|LIMIT|OFFSET|VALUES|$)) | LIMIT\s+\d+ | OFFSET\s+\d+ | VALUES\s*(?:\{.*?\}|\w+|\(.*?\)) ) )* # Match zero or more clauses ) """, re.DOTALL | re.IGNORECASE | re.VERBOSE ) def extract_sparql(text): code_block_match = re.search(r"```(?:sparql)?\s*(.*?)\s*```", text, re.DOTALL | re.IGNORECASE) if code_block_match: text_to_search = code_block_match.group(1) else: text_to_search = text # Search directly if no markdown code block match = sparql_pattern_strict.search(text_to_search) if match: return match.group(0).strip() else: fallback_match = re.search(r"(SELECT|ASK|CONSTRUCT|DESCRIBE).*?\}", text_to_search, re.DOTALL | re.IGNORECASE) if fallback_match: return fallback_match.group(0).strip() return "" # --- Example usage --- question = "Wer war der amerikanische weibliche Angestellte des Barnard College?" example_context_json_str = ''' { "entitäten": { "Barnard College": "Q167733", "amerikanisch": "Q30", "weiblich": "Q6581072", "Angestellte": "Q5" }, "beziehungen": { "Instanz von": "P31", "Arbeitgeber": "P108", "Geschlecht": "P21", "Land der Staatsbürgerschaft": "P27" } } ''' # System prompt template for v2 models (German) system_message_template = """Sie sind ein Experte für die Übersetzung von Text in SPARQL-Anfragen. Benutzer werden Ihnen Fragen auf Deutsch stellen, und Sie werden eine SPARQL-Anfrage basierend auf dem bereitgestellten Kontext generieren, der in ```sparql ``` eingeschlossen ist. KONTEXT: {context}""" # Format the system message with the actual context formatted_system_message = system_message_template.format(context=example_context_json_str) chat_template = [ {"role": "system", "content": formatted_system_message}, {"role": "user", "content": question}, ] inputs = tokenizer.apply_chat_template( chat_template, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to(model.device) # Generate the output with torch.no_grad(): outputs = model.generate( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=512, # From your v2 generation script do_sample=True, # From your v2 generation script temperature=0.7, # Common for sampling top_p=0.9, # Common for sampling pad_token_id=tokenizer.pad_token_id ) # Decode only the generated part generated_text_full = tokenizer.decode(outputs[0], skip_special_tokens=True) # Extract only assistant's response (ChatML format specific extraction) assistant_response_part = "" if "<|im_start|>assistant" in generated_text_full: # Specific to ChatML after template application assistant_response_part = generated_text_full.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip() elif "assistant\n" in generated_text_full: # More generic if template output varies assistant_response_part = generated_text_full.split("assistant\n")[-1].strip() else: input_length = inputs.input_ids.shape[1] assistant_response_part = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True).strip() cleaned_sparql = extract_sparql(assistant_response_part) print(f"Frage: {question}") print(f"Kontext: {example_context_json_str}") print(f"Generierte SPARQL: {cleaned_sparql}") print(f"Textausgabe (Assistent): {assistant_response_part}") ``` ### Training Data The model was fine-tuned on a subset of the `julioc-p/Question-Sparql` dataset. 80,000 German examples #### Training Hyperparameters The following hyperparameters were used: - **LoRA Configuration:** - `r` (LoRA rank): 256 - `lora_alpha`: 128 - `lora_dropout`: 0.05 - `target_modules`: "all-linear" - `task_type`: "CAUSAL_LM" - **Training Arguments:** - `num_train_epochs`: 3 - Effective batch size: 6 (per_device_train_batch_size=1, gradient_accumulation_steps=6) - `optim`: "adamw_torch_fused" - `learning_rate`: 2e-4 - `weight_decay`: 0.05 - `fp16`: True - `max_grad_norm`: 0.3 - `warmup_ratio`: 0.03 - `lr_scheduler_type`: "constant" - `packing`: True - NEFTune `noise_alpha`: 5 - **BitsAndBytesConfig:** - `load_in_4bit`: True - `bnb_4bit_quant_type`: "nf4" - `bnb_4bit_compute_dtype`: `torch.float16` - `bnb_4bit_use_double_quant`: True #### Speeds, Sizes, Times - Mistral German v2 training took approx. 19-20 hours on a single NVIDIA V100 GPU. ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data 1. **QALD-10 test set (German):** Standardized benchmark. 394 German questions were evaluated for this model. 2. **v2 Test Set (German):** 10,000 German held-out examples from the `julioc-p/Question-Sparql` dataset, including context. #### Metrics QALD standard macro-averaged F1-score, Precision, and Recall. Non-executable queries: P, R, F1 = 0. Executable Queries % tracked. ### Results **On QALD-10 (German, N=394):** - **Macro F1-Score:** 0.2595 - **Macro Precision:** 0.6419 - **Macro Recall:** 0.2632 - **Executable Queries:** 99.75% (393/394) - **Correctness (Exact Match + Both Empty):** 25.13% (99/394) - Correct (Exact Match): 23.86% (94/394) - Correct (Both Empty): 1.27% (5/394) **On v2 Test Set (German, N=10000):** - **Macro F1-Score:** 0.7183 - **Macro Precision:** 0.8362 - **Macro Recall:** 0.7198 - **Executable Queries:** 97.27% (9727/10000) - **Correctness (Exact Match + Both Empty):** 71.58% (7158/10000) - Correct (Exact Match): 62.74% (6274/10000) - Correct (Both Empty): 8.84% (884/10000) ## Environmental Impact - **Hardware Type:** 1 x NVIDIA V100 32GB GPU - **Hours used:** Approx. 19-20 hours for fine-tuning. - **Cloud Provider:** DFKI HPC Cluster - **Compute Region:** Germany - **Carbon Emitted:** Approx. 2.96 kg CO2eq. ## Technical Specifications ### Compute Infrastructure #### Hardware - NVIDIA V100 GPU (32 GB RAM) - Approx. 60 GB system RAM #### Software - Slurm, NVIDIA Enroot, CUDA 11.8.0 - Python, Hugging Face `transformers`, `peft` (0.13.2), `bitsandbytes`, `trl`, PyTorch. ## More Information - **Thesis GitHub:** [https://github.com/julioc-p/cross-lingual-transferability-thesis](https://github.com/julioc-p/cross-lingual-transferability-thesis) - **Dataset:** [https://huggingface.co/datasets/julioc-p/Question-Sparql](https://huggingface.co/datasets/julioc-p/Question-Sparql) - ### Framework versions - PEFT 0.13.2 - Transformers (`4.39.3`) - BitsAndBytes (`0.43.0`) - trl (`0.8.6`) - PyTorch (`torch==2.1.0`)