---
license: cc-by-4.0
language:
- az
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: token-classification
tags:
- personally identifiable information
- pii
- ner
- azerbaijan
datasets:
- LocalDoc/pii_ner_azerbaijani
---
# PII NER Azerbaijani v2
**PII NER Azerbaijani** is a second version of fine-tuned Named Entity Recognition (NER) model (First version: PII NER Azerbaijani) based on XLM-RoBERTa.
It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text.
## Model Details
- **Base Model:** XLM-RoBERTa
- **Training Metrics:**
-
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|----------------|------------------|-----------|---------|----------|
| 1 | 0.029100 | 0.025319 | 0.963367 | 0.962449| 0.962907 |
| 2 | 0.019900 | 0.023291 | 0.964567 | 0.968474| 0.966517 |
| 3 | 0.015400 | 0.018993 | 0.969536 | 0.967555| 0.968544 |
| 4 | 0.012700 | 0.017730 | 0.971919 | 0.969768| 0.970842 |
| 5 | 0.011100 | 0.018095 | 0.973056 | 0.970075| 0.971563 |
- **Test Metrics:**
- **Precision:** 0.9760
- **Recall:** 0.9732
- **F1 Score:** 0.9746
## Detailed Test Classification Report
| Entity | Precision | Recall | F1-score | Support |
|---------------------|-----------|--------|----------|---------|
| AGE | 0.98 | 0.98 | 0.98 | 509 |
| BUILDINGNUM | 0.97 | 0.75 | 0.85 | 1285 |
| CITY | 1.00 | 1.00 | 1.00 | 2100 |
| CREDITCARDNUMBER | 0.99 | 0.98 | 0.99 | 249 |
| DATE | 0.85 | 0.92 | 0.88 | 1576 |
| DRIVERLICENSENUM | 0.98 | 0.98 | 0.98 | 258 |
| EMAIL | 0.98 | 1.00 | 0.99 | 1485 |
| GIVENNAME | 0.99 | 1.00 | 0.99 | 9926 |
| IDCARDNUM | 0.99 | 0.99 | 0.99 | 1174 |
| PASSPORTNUM | 0.99 | 0.99 | 0.99 | 426 |
| STREET | 0.94 | 0.98 | 0.96 | 1480 |
| SURNAME | 1.00 | 1.00 | 1.00 | 3357 |
| TAXNUM | 0.99 | 1.00 | 0.99 | 240 |
| TELEPHONENUM | 0.97 | 0.95 | 0.96 | 2175 |
| TIME | 0.96 | 0.96 | 0.96 | 2216 |
| ZIPCODE | 0.97 | 0.97 | 0.97 | 520 |
### Averages
| Metric | Precision | Recall | F1-score | Support |
|---------------|-----------|--------|----------|---------|
| **Micro avg** | 0.98 | 0.97 | 0.97 | 28976 |
| **Macro avg** | 0.97 | 0.96 | 0.97 | 28976 |
| **Weighted avg** | 0.98 | 0.97 | 0.97 | 28976 |
## A list of entities that the model is able to recognize.
```python
[
"AGE",
"BUILDINGNUM",
"CITY",
"CREDITCARDNUMBER",
"DATE",
"DRIVERLICENSENUM",
"EMAIL",
"GIVENNAME",
"IDCARDNUM",
"PASSPORTNUM",
"STREET",
"SURNAME",
"TAXNUM",
"TELEPHONENUM",
"TIME",
"ZIPCODE"
]
```
## Usage
To use the model for spell correction:
The model is trained to work with lowercase text. This code automatically normalizes the text. If you use custom code, keep this in mind.
```python
import torch
from transformers import AutoModelForTokenClassification, XLMRobertaTokenizerFast
import numpy as np
from typing import List, Dict, Tuple
class AzerbaijaniNER:
def __init__(self, model_name_or_path="LocalDoc/private_ner_azerbaijani_v2"):
self.model = AutoModelForTokenClassification.from_pretrained(model_name_or_path)
self.tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
self.model.eval()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
self.id_to_label = {
0: "O",
1: "B-AGE", 2: "B-BUILDINGNUM", 3: "B-CITY", 4: "B-CREDITCARDNUMBER",
5: "B-DATE", 6: "B-DRIVERLICENSENUM", 7: "B-EMAIL", 8: "B-GIVENNAME",
9: "B-IDCARDNUM", 10: "B-PASSPORTNUM", 11: "B-STREET", 12: "B-SURNAME",
13: "B-TAXNUM", 14: "B-TELEPHONENUM", 15: "B-TIME", 16: "B-ZIPCODE",
17: "I-AGE", 18: "I-BUILDINGNUM", 19: "I-CITY", 20: "I-CREDITCARDNUMBER",
21: "I-DATE", 22: "I-DRIVERLICENSENUM", 23: "I-EMAIL", 24: "I-GIVENNAME",
25: "I-IDCARDNUM", 26: "I-PASSPORTNUM", 27: "I-STREET", 28: "I-SURNAME",
29: "I-TAXNUM", 30: "I-TELEPHONENUM", 31: "I-TIME", 32: "I-ZIPCODE"
}
self.entity_types = {
"AGE": "Age",
"BUILDINGNUM": "Building Number",
"CITY": "City",
"CREDITCARDNUMBER": "Credit Card Number",
"DATE": "Date",
"DRIVERLICENSENUM": "Driver License Number",
"EMAIL": "Email",
"GIVENNAME": "Given Name",
"IDCARDNUM": "ID Card Number",
"PASSPORTNUM": "Passport Number",
"STREET": "Street",
"SURNAME": "Surname",
"TAXNUM": "Tax ID Number",
"TELEPHONENUM": "Phone Number",
"TIME": "Time",
"ZIPCODE": "Zip Code"
}
def predict(self, text: str, max_length: int = 512) -> List[Dict]:
text = text.lower()
inputs = self.tokenizer(
text,
return_tensors="pt",
max_length=max_length,
padding="max_length",
truncation=True,
return_offsets_mapping=True
)
offset_mapping = inputs.pop("offset_mapping").numpy()[0]
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model(**inputs)
predictions = outputs.logits.argmax(dim=2)
predictions = predictions[0].cpu().numpy()
entities = []
current_entity = None
for idx, (offset, pred_id) in enumerate(zip(offset_mapping, predictions)):
if offset[0] == 0 and offset[1] == 0:
continue
pred_label = self.id_to_label[pred_id]
if pred_label.startswith("B-"):
if current_entity:
entities.append(current_entity)
entity_type = pred_label[2:]
current_entity = {
"label": entity_type,
"name": self.entity_types.get(entity_type, entity_type),
"start": int(offset[0]),
"end": int(offset[1]),
"value": text[offset[0]:offset[1]]
}
elif pred_label.startswith("I-") and current_entity is not None:
entity_type = pred_label[2:]
if entity_type == current_entity["label"]:
current_entity["end"] = int(offset[1])
current_entity["value"] = text[current_entity["start"]:current_entity["end"]]
else:
entities.append(current_entity)
current_entity = None
elif pred_label == "O" and current_entity is not None:
entities.append(current_entity)
current_entity = None
if current_entity:
entities.append(current_entity)
return entities
def anonymize_text(self, text: str, replacement_char: str = "X") -> Tuple[str, List[Dict]]:
entities = self.predict(text)
if not entities:
return text, []
entities.sort(key=lambda x: x["start"], reverse=True)
anonymized_text = text
for entity in entities:
start = entity["start"]
end = entity["end"]
length = end - start
anonymized_text = anonymized_text[:start] + replacement_char * length + anonymized_text[end:]
entities.sort(key=lambda x: x["start"])
return anonymized_text, entities
def highlight_entities(self, text: str) -> str:
entities = self.predict(text)
if not entities:
return text
entities.sort(key=lambda x: x["start"], reverse=True)
highlighted_text = text
for entity in entities:
start = entity["start"]
end = entity["end"]
entity_value = entity["value"]
entity_type = entity["name"]
highlighted_text = (
highlighted_text[:start] +
f"[{entity_type}: {entity_value}]" +
highlighted_text[end:]
)
return highlighted_text
if __name__ == "__main__":
ner = AzerbaijaniNER()
test_text = """Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?"""
print("=== Original Text ===")
print(test_text)
print("\n=== Found Entities ===")
entities = ner.predict(test_text)
for entity in entities:
print(f"{entity['name']}: {entity['value']} (positions {entity['start']}-{entity['end']})")
print("\n=== Text with Highlighted Entities ===")
highlighted_text = ner.highlight_entities(test_text)
print(highlighted_text)
print("\n=== Anonymized Text ===")
anonymized_text, _ = ner.anonymize_text(test_text)
print(anonymized_text)
```
```
=== Original Text ===
Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?
=== Found Entities ===
Given Name: əli (positions 18-21)
Surname: hüseynov (positions 22-30)
Date: 15.05.1990 (positions 48-58)
City: bakı (positions 64-68)
Street: 28 may küçəsi (positions 80-93)
Building Number: 4 (positions 94-95)
Phone Number: +994552345678 (positions 132-145)
Credit Card Number: 4169741358254152 (positions 155-171)
=== Text with Highlighted Entities ===
Salam, mənim adım [Given Name: əli] [Surname: hüseynov]du. Doğum tarixim [Date: 15.05.1990]-dır. [City: bakı] şəhərində, [Street: 28 may küçəsi] [Building Number: 4] ünvanında yaşayıram. Telefon nömrəm [Phone Number: +994552345678]-dir. Mən [Credit Card Number: 4169741358254152] nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?
=== Anonymized Text ===
Salam, mənim adım XXX XXXXXXXXdu. Doğum tarixim XXXXXXXXXX-dır. XXXX şəhərində, XXXXXXXXXXXXX X ünvanında yaşayıram. Telefon nömrəm XXXXXXXXXXXXX-dir. Mən XXXXXXXXXXXXXXXX nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?
```
## CC BY 4.0 License — What It Allows
The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows:
### ✅ You Can:
- **Use** the model for any purpose, including commercial use.
- **Share** it — copy and redistribute in any medium or format.
- **Adapt** it — remix, transform, and build upon it for any purpose, even commercially.
### 📝 You Must:
- **Give appropriate credit** — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made).
- **Not imply endorsement** — Do not suggest the original author endorses you or your use.
### ❌ You Cannot:
- Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions).
### Summary:
You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.
For more information, please refer to the CC BY 4.0 license.
## Contact
For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].