DistilBERT for PII Detection

This model is a fine-tuned DistilBERT (distilbert-base-uncased) for Named Entity Recognition (NER), specifically designed to detect Personally Identifiable Information (PII) in English text.
It was trained on a custom dataset of 4138 samples with 18 entity classes relevant to compliance, finance, and legal text redaction.

Model Description

Developed by: Independent (2025)
Model type: Token classification (NER)
Language(s): English
License: Apache-2.0
Fine-tuned from: distilbert-base-uncased
Parameters: ~66M

The model identifies PII entities such as names, email addresses, phone numbers, financial amounts, dates, and credentials, making it suitable for document redaction and compliance automation (GDPR, HIPAA, PCI-DSS).

Entity Classes

The model supports 18 entity classes (plus O for non-entity tokens):

Entity	Description
`AMOUNT`	Monetary values, amounts, percentages
`COUNTRY`	Country names
`CREDENTIALS`	Passwords, access keys, or secret tokens
`DATE`	Calendar dates
`EMAIL`	Email addresses
`EXPIRYDATE`	Expiry dates (e.g., card expiry)
`FIRSTNAME`	First names
`IPADDRESS`	IPv4 or IPv6 addresses
`LASTNAME`	Last names
`LOCATION`	General locations (cities, regions, etc.)
`MACADDRESS`	MAC addresses
`NUMBER`	Generic numeric identifiers
`ORGANIZATION`	Company or institution names
`PERCENT`	Percentages
`PHONE`	Phone numbers
`TIME`	Time expressions (HH:MM, AM/PM, etc.)
`UID`	Unique IDs (customer IDs, transaction IDs, etc.)
`ZIPCODE`	Postal/ZIP codes

Uses

Direct Use

Detect and mask/redact PII in unstructured text.
Document anonymization for legal, financial, and healthcare records.
Compliance automation for GDPR, HIPAA, PCI-DSS.

Downstream Use

Integrate into ETL pipelines, chatbots, or audit workflows.
Extend with multi-language fine-tuning for broader use cases.

Out-of-Scope Use

Should not be the sole compliance system without human validation.
Not designed for languages other than English.
May misclassify in noisy, slang-heavy, or highly domain-specific text.

Compute

Hardware: Google Colab Pro | GPU: Tesla T4 (15.8 GB VRAM)
Frameworks: PyTorch + Hugging Face Transformers v4.56.1
Training Time: ~1.5 hours

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("narayan214/distilbert-pii-before-v2")
model = AutoModelForTokenClassification.from_pretrained("narayan214/distilbert-pii-before-v2")

pii_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "John Doe's email is [email protected] and his phone number is +1-202-555-0173."
print(pii_pipeline(text))

Citation

If you use this model, please cite the original DistilBERT paper:

BibTeX:

@article{sanh2019distilbert,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
  journal={arXiv preprint arXiv:1910.01108},
  year={2019}
}

Downloads last month: 32

Safetensors

Model size

65.2M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for narayan214/distilbert-pii-before-v2

Base model

distilbert/distilbert-base-uncased

Finetuned

(10005)

this model