DistilBERT for PII Detection

This model is a fine-tuned DistilBERT (distilbert-base-uncased) for Named Entity Recognition (NER), specifically designed to detect Personally Identifiable Information (PII) in English text.
It was trained on a custom dataset of 4138 samples with 18 entity classes relevant to compliance, finance, and legal text redaction.


Model Description

  • Developed by: Independent (2025)
  • Model type: Token classification (NER)
  • Language(s): English
  • License: Apache-2.0
  • Fine-tuned from: distilbert-base-uncased
  • Parameters: ~66M

The model identifies PII entities such as names, email addresses, phone numbers, financial amounts, dates, and credentials, making it suitable for document redaction and compliance automation (GDPR, HIPAA, PCI-DSS).


Entity Classes

The model supports 18 entity classes (plus O for non-entity tokens):

Entity Description
AMOUNT Monetary values, amounts, percentages
COUNTRY Country names
CREDENTIALS Passwords, access keys, or secret tokens
DATE Calendar dates
EMAIL Email addresses
EXPIRYDATE Expiry dates (e.g., card expiry)
FIRSTNAME First names
IPADDRESS IPv4 or IPv6 addresses
LASTNAME Last names
LOCATION General locations (cities, regions, etc.)
MACADDRESS MAC addresses
NUMBER Generic numeric identifiers
ORGANIZATION Company or institution names
PERCENT Percentages
PHONE Phone numbers
TIME Time expressions (HH:MM, AM/PM, etc.)
UID Unique IDs (customer IDs, transaction IDs, etc.)
ZIPCODE Postal/ZIP codes

Uses

Direct Use

  • Detect and mask/redact PII in unstructured text.
  • Document anonymization for legal, financial, and healthcare records.
  • Compliance automation for GDPR, HIPAA, PCI-DSS.

Downstream Use

  • Integrate into ETL pipelines, chatbots, or audit workflows.
  • Extend with multi-language fine-tuning for broader use cases.

Out-of-Scope Use

  • Should not be the sole compliance system without human validation.
  • Not designed for languages other than English.
  • May misclassify in noisy, slang-heavy, or highly domain-specific text.

Compute

  • Hardware: Google Colab Pro | GPU: Tesla T4 (15.8 GB VRAM)
  • Frameworks: PyTorch + Hugging Face Transformers v4.56.1
  • Training Time: ~1.5 hours

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("narayan214/distilbert-pii-before-v2")
model = AutoModelForTokenClassification.from_pretrained("narayan214/distilbert-pii-before-v2")

pii_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "John Doe's email is [email protected] and his phone number is +1-202-555-0173."
print(pii_pipeline(text))

Citation

If you use this model, please cite the original DistilBERT paper:

BibTeX:

@article{sanh2019distilbert,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
  journal={arXiv preprint arXiv:1910.01108},
  year={2019}
}
Downloads last month
32
Safetensors
Model size
65.2M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for narayan214/distilbert-pii-before-v2

Finetuned
(10005)
this model