DistilBERT for PII Detection
This model is a fine-tuned DistilBERT (distilbert-base-uncased) for Named Entity Recognition (NER), specifically designed to detect Personally Identifiable Information (PII) in English text.
It was trained on a custom dataset of 4138 samples with 18 entity classes relevant to compliance, finance, and legal text redaction.
Model Description
- Developed by: Independent (2025)
- Model type: Token classification (NER)
- Language(s): English
- License: Apache-2.0
- Fine-tuned from: distilbert-base-uncased
- Parameters: ~66M
The model identifies PII entities such as names, email addresses, phone numbers, financial amounts, dates, and credentials, making it suitable for document redaction and compliance automation (GDPR, HIPAA, PCI-DSS).
Entity Classes
The model supports 18 entity classes (plus O for non-entity tokens):
| Entity | Description |
|---|---|
AMOUNT |
Monetary values, amounts, percentages |
COUNTRY |
Country names |
CREDENTIALS |
Passwords, access keys, or secret tokens |
DATE |
Calendar dates |
EMAIL |
Email addresses |
EXPIRYDATE |
Expiry dates (e.g., card expiry) |
FIRSTNAME |
First names |
IPADDRESS |
IPv4 or IPv6 addresses |
LASTNAME |
Last names |
LOCATION |
General locations (cities, regions, etc.) |
MACADDRESS |
MAC addresses |
NUMBER |
Generic numeric identifiers |
ORGANIZATION |
Company or institution names |
PERCENT |
Percentages |
PHONE |
Phone numbers |
TIME |
Time expressions (HH:MM, AM/PM, etc.) |
UID |
Unique IDs (customer IDs, transaction IDs, etc.) |
ZIPCODE |
Postal/ZIP codes |
Uses
Direct Use
- Detect and mask/redact PII in unstructured text.
- Document anonymization for legal, financial, and healthcare records.
- Compliance automation for GDPR, HIPAA, PCI-DSS.
Downstream Use
- Integrate into ETL pipelines, chatbots, or audit workflows.
- Extend with multi-language fine-tuning for broader use cases.
Out-of-Scope Use
- Should not be the sole compliance system without human validation.
- Not designed for languages other than English.
- May misclassify in noisy, slang-heavy, or highly domain-specific text.
Compute
- Hardware: Google Colab Pro | GPU: Tesla T4 (15.8 GB VRAM)
- Frameworks: PyTorch + Hugging Face Transformers v4.56.1
- Training Time: ~1.5 hours
How to Use
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("narayan214/distilbert-pii-before-v2")
model = AutoModelForTokenClassification.from_pretrained("narayan214/distilbert-pii-before-v2")
pii_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "John Doe's email is [email protected] and his phone number is +1-202-555-0173."
print(pii_pipeline(text))
Citation
If you use this model, please cite the original DistilBERT paper:
BibTeX:
@article{sanh2019distilbert,
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
journal={arXiv preprint arXiv:1910.01108},
year={2019}
}
- Downloads last month
- 32
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for narayan214/distilbert-pii-before-v2
Base model
distilbert/distilbert-base-uncased