Spam Email Classifier - RoBERTa-base with LoRA (r=4)
This model is a LoRA adapter for spam email classification, fine-tuned on the Email Spam Classification Dataset with 83,448 emails.
Model Description
- Base Model: FacebookAI/roberta-base
- LoRA Rank: 4
- LoRA Alpha: 8
- Task: Binary Text Classification (Spam/Ham)
- Training Dataset: 83,448 emails (66,758 training samples)
- Trainable Parameters: 1,255,682 (1.00% of total)
- Total Parameters: 125,902,852
Performance
| Metric | Score |
|---|---|
| Accuracy | 99.41% |
| Precision | 99.50% |
| Recall | 99.39% |
| F1 Score | 99.44% |
| ROC-AUC | 0.9990 |
| PR-AUC | 0.9988 |
Training Time: 544.60 minutes (~9.1 hours)
Usage
Method 1: Using the Inference Script (Recommended)
Download the inference script and config from the GitHub repository:
# Download inference files
wget https://raw.githubusercontent.com/sherozshaikh/spam-email-classification-lora/main/inference/inference.py
wget https://raw.githubusercontent.com/sherozshaikh/spam-email-classification-lora/main/inference/inference_config.yaml
# Update inference_config.yaml with this model:
# base_model_name: "FacebookAI/roberta-base"
# adapter_path: "ssheroz/spam-email-classifier-roberta-r4"
Python API:
from inference import SpamClassifier
# Initialize classifier
classifier = SpamClassifier(config_path="inference_config.yaml")
# Classify single email
email = "Subject: URGENT! You've won $1,000,000! Click here to claim now!"
result = classifier.predict_single(email)
print(f"Prediction: {result['label']}")
print(f"Confidence: {result['confidence']:.2%}")
print(f"Probabilities: {result['probabilities']}")
Command Line:
# Single email prediction
python inference.py --text "Subject: Meeting tomorrow at 2pm"
# Batch prediction from CSV
python inference.py --input_file emails.csv --output_file predictions.csv
Method 2: Direct Usage with Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch
# Load base model and tokenizer
base_model_name = "FacebookAI/roberta-base"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(
base_model_name,
num_labels=2,
problem_type="single_label_classification"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "ssheroz/spam-email-classifier-roberta-r4")
model.eval()
# Inference
text = "Subject: URGENT! You've won $1,000,000! Click here now!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
probabilities = torch.softmax(outputs.logits, dim=1)
prediction = torch.argmax(probabilities, dim=1).item()
label = "SPAM" if prediction == 1 else "HAM"
confidence = probabilities[0][prediction].item()
print(f"Prediction: {label} (Confidence: {confidence:.2%})")
Training Details
Hyperparameters
- Epochs: 2
- Learning Rate: 2e-4
- Batch Size: 16
- Optimizer: AdamW with weight decay (0.01)
- Scheduler: Cosine with warmup (10% warmup ratio)
- Gradient Clipping: 1.0
- Mixed Precision: FP16
- Early Stopping: Patience=2
LoRA Configuration
- Rank (r): 4
- Alpha: 8
- Dropout: 0.1
- Target Modules: query, key, value, dense (all attention layers)
Data Split
- Train: 66,758 samples (80%)
- Validation: 8,345 samples (10%)
- Test: 8,345 samples (10%)
Limitations
- Trained primarily on English emails
- Performance may degrade on domain-specific spam (e.g., social media, SMS)
- Requires periodic retraining for evolving spam patterns
- False positives (legitimate emails marked as spam) can occur with unusual email patterns
Ethical Considerations
- False positives may cause users to miss important emails
- Should be used as part of a larger system with human oversight for critical applications
- Regular monitoring and updates recommended to maintain effectiveness
Citation
If you use this model, please cite:
@misc{shaikh2025spamclassifier,
author = {Sheroz Shaikh},
title = {Spam Email Classification using LoRA Fine-tuned Transformers},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/ssheroz/spam-email-classifier-roberta-r4}}
}
Related Models
GitHub Repository
Full training code, analysis, and inference scripts: spam-email-classification-lora
License
MIT License - See LICENSE for details.
Contact
- GitHub: @sherozshaikh
- HuggingFace: @ssheroz
- Downloads last month
- 17
Model tree for ssheroz/spam-email-classifier-roberta-r4
Base model
FacebookAI/roberta-base