Spam Email Classifier - RoBERTa-base with LoRA (r=4)

This model is a LoRA adapter for spam email classification, fine-tuned on the Email Spam Classification Dataset with 83,448 emails.

Model Description

  • Base Model: FacebookAI/roberta-base
  • LoRA Rank: 4
  • LoRA Alpha: 8
  • Task: Binary Text Classification (Spam/Ham)
  • Training Dataset: 83,448 emails (66,758 training samples)
  • Trainable Parameters: 1,255,682 (1.00% of total)
  • Total Parameters: 125,902,852

Performance

Metric Score
Accuracy 99.41%
Precision 99.50%
Recall 99.39%
F1 Score 99.44%
ROC-AUC 0.9990
PR-AUC 0.9988

Training Time: 544.60 minutes (~9.1 hours)

Usage

Method 1: Using the Inference Script (Recommended)

Download the inference script and config from the GitHub repository:

# Download inference files
wget https://raw.githubusercontent.com/sherozshaikh/spam-email-classification-lora/main/inference/inference.py
wget https://raw.githubusercontent.com/sherozshaikh/spam-email-classification-lora/main/inference/inference_config.yaml

# Update inference_config.yaml with this model:
# base_model_name: "FacebookAI/roberta-base"
# adapter_path: "ssheroz/spam-email-classifier-roberta-r4"

Python API:

from inference import SpamClassifier

# Initialize classifier
classifier = SpamClassifier(config_path="inference_config.yaml")

# Classify single email
email = "Subject: URGENT! You've won $1,000,000! Click here to claim now!"
result = classifier.predict_single(email)

print(f"Prediction: {result['label']}")
print(f"Confidence: {result['confidence']:.2%}")
print(f"Probabilities: {result['probabilities']}")

Command Line:

# Single email prediction
python inference.py --text "Subject: Meeting tomorrow at 2pm"

# Batch prediction from CSV
python inference.py --input_file emails.csv --output_file predictions.csv

Method 2: Direct Usage with Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

# Load base model and tokenizer
base_model_name = "FacebookAI/roberta-base"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(
    base_model_name,
    num_labels=2,
    problem_type="single_label_classification"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "ssheroz/spam-email-classifier-roberta-r4")
model.eval()

# Inference
text = "Subject: URGENT! You've won $1,000,000! Click here now!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.softmax(outputs.logits, dim=1)
    prediction = torch.argmax(probabilities, dim=1).item()

label = "SPAM" if prediction == 1 else "HAM"
confidence = probabilities[0][prediction].item()

print(f"Prediction: {label} (Confidence: {confidence:.2%})")

Training Details

Hyperparameters

  • Epochs: 2
  • Learning Rate: 2e-4
  • Batch Size: 16
  • Optimizer: AdamW with weight decay (0.01)
  • Scheduler: Cosine with warmup (10% warmup ratio)
  • Gradient Clipping: 1.0
  • Mixed Precision: FP16
  • Early Stopping: Patience=2

LoRA Configuration

  • Rank (r): 4
  • Alpha: 8
  • Dropout: 0.1
  • Target Modules: query, key, value, dense (all attention layers)

Data Split

  • Train: 66,758 samples (80%)
  • Validation: 8,345 samples (10%)
  • Test: 8,345 samples (10%)

Limitations

  • Trained primarily on English emails
  • Performance may degrade on domain-specific spam (e.g., social media, SMS)
  • Requires periodic retraining for evolving spam patterns
  • False positives (legitimate emails marked as spam) can occur with unusual email patterns

Ethical Considerations

  • False positives may cause users to miss important emails
  • Should be used as part of a larger system with human oversight for critical applications
  • Regular monitoring and updates recommended to maintain effectiveness

Citation

If you use this model, please cite:

@misc{shaikh2025spamclassifier,
  author = {Sheroz Shaikh},
  title = {Spam Email Classification using LoRA Fine-tuned Transformers},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ssheroz/spam-email-classifier-roberta-r4}}
}

Related Models

GitHub Repository

Full training code, analysis, and inference scripts: spam-email-classification-lora

License

MIT License - See LICENSE for details.

Contact

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ssheroz/spam-email-classifier-roberta-r4

Adapter
(304)
this model