|
|
--- |
|
|
base_model: occiglot/occiglot-7b-eu5-instruct |
|
|
library_name: peft |
|
|
license: mit |
|
|
datasets: |
|
|
- julioc-p/Question-Sparql |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
tags: |
|
|
- code |
|
|
- text-to-sparql |
|
|
- sparql |
|
|
- wikidata |
|
|
--- |
|
|
|
|
|
This model is a fine-tuned version of `occiglot/occiglot-7b-eu5-instruct` for generating SPARQL queries from English natural language questions, specifically targeting the Wikidata knowledge graph. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
It was fine-tuned using QLoRA. |
|
|
It takes an English natural language question as input and aims to produce a corresponding SPARQL query that can be executed against the Wikidata knowledge graph. |
|
|
It is part of a series of experiments to investigate the impact of continual multilingual pre-training on cross-lingual transferability and task-specific performance. |
|
|
Uses 4-bit quantization. |
|
|
|
|
|
- **Developed by:** Julio Cesar Perez Duran |
|
|
- **Funded by :** DFKI |
|
|
- **Model type:** Decoder-only Transformer-based language model |
|
|
- **Language(s) (NLP):** en (English) |
|
|
- **License:** mit |
|
|
- **Finetuned from model [optional]:** `occiglot/occiglot-7b-eu5-instruct` |
|
|
|
|
|
## Uses |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- **Entity/Relationship Linking Bottleneck:** A primary limitation of this model is a significant deficiency in accurately mapping textual entities and relationships to their correct Wikidata identifiers (QIDs and PIDs). While the model might generate structurally valid SPARQL, the entities or properties could be incorrect (e.g., mapping "Bill Gates" to Microsoft's QID instead of Bill Gates'). This significantly impacted recall. |
|
|
## How to Get Started with the Model |
|
|
|
|
|
The following Python script provides an example of how to load the model and tokenizer using the Hugging Face Transformers and PEFT libraries to generate a SPARQL query. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
|
|
import re |
|
|
|
|
|
model_id = "julioc-p/occiglot-7b-eu5-instruct-txt-sparql-4bit" # |
|
|
base_model_for_tokenizer = "occiglot/occiglot-7b-eu5-instruct" # |
|
|
|
|
|
# Configuration for 4-bit quantization |
|
|
bnb_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, # |
|
|
bnb_4bit_quant_type="nf4", |
|
|
bnb_4bit_compute_dtype=torch.bfloat16, |
|
|
bnb_4bit_use_double_quant=False, |
|
|
) |
|
|
|
|
|
# Load the model and tokenizer |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
quantization_config=bnb_config, |
|
|
device_map="auto" # Automatically uses GPU if available |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained(base_model_for_tokenizer) |
|
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
|
|
|
|
|
|
def extract_sparql(text): # |
|
|
match = re.search( |
|
|
r"(SELECT|ASK|CONSTRUCT|DESCRIBE).*?\}", text, re.DOTALL | re.IGNORECASE # |
|
|
) |
|
|
if match: |
|
|
return match.group(0).strip() # |
|
|
return "" |
|
|
|
|
|
|
|
|
# --- Example usage --- |
|
|
question = "What is the boiling point of water?" |
|
|
knowledge_graph_target = "Wikidata" |
|
|
|
|
|
system_prompt_content = "You are an expert in SPARQL query generation. Generate the SPARQL query that answers the user's question." |
|
|
|
|
|
|
|
|
chat_template = [ |
|
|
{"role": "system", "content": system_prompt_content}, |
|
|
{"role": "user", "content": question}, |
|
|
] |
|
|
|
|
|
inputs = tokenizer.apply_chat_template( |
|
|
chat_template, |
|
|
tokenize=True, |
|
|
add_generation_prompt=True, # This adds the assistant's turn start |
|
|
return_tensors="pt" |
|
|
).to(model.device) |
|
|
|
|
|
# Generate the output |
|
|
with torch.no_grad(): |
|
|
outputs = model.generate(inputs, max_new_tokens=256, pad_token_id=tokenizer.eos_token_id) |
|
|
|
|
|
# Decode and extract the SPARQL query |
|
|
# Decode only the generated part (assistant's response) |
|
|
generated_text_assistant_part = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True) |
|
|
|
|
|
cleaned_sparql = extract_sparql(generated_text_assistant_part) |
|
|
|
|
|
print(f"Question: {question}") |
|
|
print(f"Generated SPARQL: {cleaned_sparql}") |
|
|
print(f"Raw generated text (assistant part): {generated_text_assistant_part}") |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was fine-tuned on a subset of the `julioc-p/Question-Sparql` dataset. Specifically, a 35,000-sample English subset filtered to include only Wikidata-related queries was used. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
The following hyperparameters were used for the v1.1 Occiglot English fine-tuning: |
|
|
- **Base Model:** `occiglot/occiglot-7b-eu5-instruct` |
|
|
- **LoRA Configuration (for Occiglot v1.1):** |
|
|
- `r` (LoRA rank): 64 |
|
|
- `lora_alpha`: 16 |
|
|
- `lora_dropout`: 0.1 |
|
|
- `bias`: "none" |
|
|
- `task_type`: "CAUSAL_LM" |
|
|
- `target_modules`: "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "lm_head" |
|
|
- **Training Arguments:** |
|
|
- `num_train_epochs`: 5 |
|
|
- `per_device_train_batch_size`: 1 |
|
|
- `gradient_accumulation_steps`: 8 (Effective batch size of 8) |
|
|
- `gradient_checkpointing`: True |
|
|
- `optim`: "paged_adamw_32bit" |
|
|
- `learning_rate`: 1e-5 |
|
|
- `weight_decay`: 0.05 |
|
|
- `bf16`: False (NVIDIA V100 does not support bfloat16) |
|
|
- `fp16`: True |
|
|
- `max_grad_norm`: 1.0 |
|
|
- `warmup_ratio`: 0.01 |
|
|
- `lr_scheduler_type`: "cosine" |
|
|
- `group_by_length`: True |
|
|
- `packing`: False (for v1 models) |
|
|
- **BitsAndBytesConfig:** |
|
|
- `load_in_4bit`: True |
|
|
- `bnb_4bit_quant_type`: "nf4" |
|
|
- `bnb_4bit_compute_dtype`: `torch.float16` |
|
|
- `bnb_4bit_use_double_quant`: False |
|
|
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
|
|
|
- The training took approximately 19-20 hours for 5 epochs on a single NVIDIA V100 GPU. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
1. **QALD-10 test set:** |
|
|
2. **v1 Test Set (English):** 3,500 English held out examples randomly sampled from the `julioc-p/Question-Sparql` dataset. |
|
|
-7b-eu5-instruct fine-tuned with QLoRA (v1.1). |
|
|
|
|
|
#### Metrics |
|
|
|
|
|
The primary evaluation metrics used were the QALD standard macro-averaged F1-score, Precision, and Recall. Non-executable queries resulted in P, R, F1 = 0. |
|
|
The percentage of **Executable Queries** was also tracked. Correctness was further broken down into "Correct (Exact Match)" and "Correct (Both Empty)". |
|
|
|
|
|
### Results |
|
|
|
|
|
**On QALD-10 (English, N=391):** |
|
|
- **Macro F1-Score:** 0.0439 |
|
|
- **Macro Precision:** 0.7698 |
|
|
- **Macro Recall:** 0.0437 |
|
|
- **Executable Queries:** 95.40% (373/391) |
|
|
- **Correctness (Exact Match + Both Empty):** 4.35% (17/391) |
|
|
- Correct (Exact Match): 3.07% (12/391) |
|
|
- Correct (Both Empty): 1.28% (5/391) |
|
|
|
|
|
**v1 Test Set (English, N=3500):** |
|
|
- **Macro F1-Score:** 0.2998 |
|
|
- **Macro Precision:** 0.9884 |
|
|
- **Macro Recall:** 0.3013 |
|
|
- **Executable Queries:** 99.57% (3485/3500) |
|
|
- **Correctness (Exact Match + Both Empty):** 29.80% (1043/3500) |
|
|
- Correct (Exact Match): 20.83% (729/3500) |
|
|
- Correct (Both Empty): 8.97% (314/3500) |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
|
|
- **Hardware Type:** 1 x NVIDIA V100 32GB GPU |
|
|
- **Hours used:** Approx. 19-20 hours for the v1.1 English model fine-tuning. |
|
|
- **Cloud Provider:** DFKI HPC Cluster |
|
|
- **Compute Region:** Germany |
|
|
- **Carbon Emitted:** Approx. 2.96 kg CO2eq. |
|
|
|
|
|
## Technical Specifications [optional] |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
#### Hardware |
|
|
|
|
|
- NVIDIA V100 GPU (32 GB RAM) |
|
|
- 60 GB system RAM |
|
|
|
|
|
#### Software |
|
|
|
|
|
- Slurm, NVIDIA Enroot, CUDA 11.8.0 |
|
|
- Python, Hugging Face `transformers`, `peft` (0.13.2), `bitsandbytes`, `trl`, PyTorch. |
|
|
|
|
|
## More Information [optional] |
|
|
|
|
|
- **Thesis GitHub:** [https://github.com/julioc-p/cross-lingual-transferability-thesis](https://github.com/julioc-p/cross-lingual-transferability-thesis) |
|
|
- **Dataset:** [https://huggingface.co/datasets/julioc-p/Question-Sparql](https://huggingface.co/datasets/julioc-p/Question-Sparql) |
|
|
|
|
|
### Framework versions |
|
|
- PEFT 0.13.2 |
|
|
- Transformers (`4.39.3`) |
|
|
- BitsAndBytes (`0.43.0`) |
|
|
- trl (`0.8.6`) |
|
|
- PyTorch (`torch==2.1.0`) |
|
|
|