---
license: apache-2.0
language:
- en
base_model:
- distilbert/distilbert-base-uncased
tags:
- pii-detection
- ner
- finance
- legal
- compliance
- privacy
---

# DistilBERT for PII Detection

This model is a fine-tuned **DistilBERT** (`distilbert-base-uncased`) for **Named Entity Recognition (NER)**, specifically designed to detect **Personally Identifiable Information (PII)** in English text.  
It was trained on a custom dataset of **4138 samples** with **18 entity classes** relevant to **compliance, finance, and legal text redaction**.

---

## Model Description

- **Developed by:** Independent (2025)  
- **Model type:** Token classification (NER)  
- **Language(s):** English  
- **License:** Apache-2.0  
- **Fine-tuned from:** [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)  
- **Parameters:** ~66M  

The model identifies PII entities such as names, email addresses, phone numbers, financial amounts, dates, and credentials, making it suitable for **document redaction** and **compliance automation** (GDPR, HIPAA, PCI-DSS).  

---

## Entity Classes

The model supports **18 entity classes** (plus `O` for non-entity tokens):

| Entity          | Description |
|-----------------|-------------|
| `AMOUNT`        | Monetary values, amounts, percentages |
| `COUNTRY`       | Country names |
| `CREDENTIALS`   | Passwords, access keys, or secret tokens |
| `DATE`          | Calendar dates |
| `EMAIL`         | Email addresses |
| `EXPIRYDATE`    | Expiry dates (e.g., card expiry) |
| `FIRSTNAME`     | First names |
| `IPADDRESS`     | IPv4 or IPv6 addresses |
| `LASTNAME`      | Last names |
| `LOCATION`      | General locations (cities, regions, etc.) |
| `MACADDRESS`    | MAC addresses |
| `NUMBER`        | Generic numeric identifiers |
| `ORGANIZATION`  | Company or institution names |
| `PERCENT`       | Percentages |
| `PHONE`         | Phone numbers |
| `TIME`          | Time expressions (HH:MM, AM/PM, etc.) |
| `UID`           | Unique IDs (customer IDs, transaction IDs, etc.) |
| `ZIPCODE`       | Postal/ZIP codes |

---

## Uses

### Direct Use
- Detect and **mask/redact PII** in unstructured text.  
- **Document anonymization** for legal, financial, and healthcare records.  
- Compliance automation for **GDPR, HIPAA, PCI-DSS**.  

### Downstream Use
- Integrate into **ETL pipelines**, **chatbots**, or **audit workflows**.  
- Extend with **multi-language fine-tuning** for broader use cases.  

### Out-of-Scope Use
- Should not be the **sole compliance system** without human validation.  
- Not designed for **languages other than English**.  
- May misclassify in noisy, slang-heavy, or highly domain-specific text.  

---

### Compute
- **Hardware:** Google Colab Pro | GPU: Tesla T4 (15.8 GB VRAM)  
- **Frameworks:** PyTorch + Hugging Face Transformers v4.56.1  
- **Training Time:** ~1.5 hours  


## How to Use

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("narayan214/distilbert-pii-before-v2")
model = AutoModelForTokenClassification.from_pretrained("narayan214/distilbert-pii-before-v2")

pii_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "John Doe's email is john.doe@example.com and his phone number is +1-202-555-0173."
print(pii_pipeline(text))
```


## Citation

If you use this model, please cite the original DistilBERT paper:

**BibTeX:**
```bibtex
@article{sanh2019distilbert,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
  journal={arXiv preprint arXiv:1910.01108},
  year={2019}
}