TWITTER Bot Detection Model

Overview

This directory contains a trained Random Forest classifier for detecting bot accounts on Twitter.

Model Version: v2 Training Date: 2025-11-27 12:08:54 Framework: scikit-learn 1.5.2 Algorithm: Random Forest Classifier with GridSearchCV Hyperparameter Tuning

📊 Model Performance

Final Metrics (Test Set)

Metric	Score
Accuracy	0.8771 (87.71%)
Precision	0.8595 (85.95%)
Recall	0.7558 (75.58%)
F1-Score	0.8043 (80.43%)
ROC-AUC	0.9354 (93.54%)
Average Precision	0.9008 (90.08%)

Model Improvement

Baseline ROC-AUC: 0.9314
Tuned ROC-AUC: 0.9354
Improvement: 0.0040 (0.43%)

🗂️ Files

File	Description
`twitter_bot_detection_v2.pkl`	Trained Random Forest model
`twitter_scaler_v2.pkl`	MinMaxScaler for feature normalization
`twitter_features_v2.json`	List of features used by the model
`twitter_metrics_v2.txt`	Detailed performance metrics report
`images/`	All visualization plots (13 images)
`README.md`	This file

🎯 Dataset Information

Training Configuration

Training Samples: 29,951
Test Samples: 7,487
Total Samples: 37,438
Number of Features: 12
Cross-Validation Folds: 5
Random State: 42

Class Distribution

Training Set:

Human (0): 20,028 (66.87%)
Bot (1): 9,923 (33.13%)

Test Set:

Human (0): 4,985 (66.58%)
Bot (1): 2,502 (33.42%)

🔧 Features (12)

has_custom_cover_image
description_length
favourites_count
followers_count
friends_count
followers_to_friends_ratio
has_location
username_digit_count
username_length
statuses_count
is_verified
account_age_days

🏆 Top 5 Most Important Features

followers_count - 0.1895
favourites_count - 0.1813
friends_count - 0.1494
statuses_count - 0.1244
account_age_days - 0.1010

⚙️ Hyperparameters

Best Parameters (from GridSearchCV)

class_weight: balanced
max_depth: 20
max_features: sqrt
min_samples_leaf: 1
min_samples_split: 2
n_estimators: 300

Parameter Search Space

n_estimators: [100, 200, 300]
max_depth: [10, 15, 20, None]
min_samples_split: [2, 5, 10]
min_samples_leaf: [1, 2, 4]
max_features: ['sqrt', 'log2']
bootstrap: [True, False]

Total combinations tested: 540

📈 Cross-Validation Results

Mean Scores (5-Fold Stratified CV)

Accuracy: 0.8750 (±0.0053)
Precision: 0.8658 (±0.0089)
Recall: 0.7368 (±0.0113)
F1-Score: 0.7961 (±0.0092)
ROC-AUC: 0.9325 (±0.0037)

🖼️ Visualizations

All visualizations are saved in the images/ directory:

01_class_distribution.png - Training/Test set class distribution
02_feature_correlation.png - Feature correlation with target variable
03_correlation_matrix.png - Feature correlation heatmap
04_baseline_confusion_matrix.png - Baseline model confusion matrix
05_baseline_roc_curve.png - Baseline ROC curve
06_baseline_precision_recall.png - Baseline Precision-Recall curve
07_baseline_feature_importance.png - Baseline feature importance
08_cross_validation.png - Cross-validation score distribution
09_tuned_confusion_matrix.png - Tuned model confusion matrix
10_tuned_roc_curve.png - Tuned ROC curve
11_tuned_precision_recall.png - Tuned Precision-Recall curve
12_tuned_feature_importance.png - Tuned feature importance
13_model_comparison.png - Baseline vs Tuned comparison

🚀 Usage Example

import joblib
import pandas as pd
import numpy as np

# Load model and scaler
model = joblib.load('twitter_bot_detection_v2.pkl')
scaler = joblib.load('twitter_scaler_v2.pkl')

# Prepare your data (example)
data = {
    'has_custom_cover_image': 0.5,
    'description_length': 0.5,
    'favourites_count': 0.5,
    'followers_count': 0.5,
    'friends_count': 0.5,
    'followers_to_friends_ratio': 0.5,
    'has_location': 0.5,
    'username_digit_count': 0.5,
    'username_length': 0.5,
    'statuses_count': 0.5,
    'is_verified': 0.5,
    'account_age_days': 0.5,
}

# Create DataFrame
df = pd.DataFrame([data])

# Scale features
df_scaled = scaler.transform(df)

# Predict
prediction = model.predict(df_scaled)[0]
probability = model.predict_proba(df_scaled)[0]

print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
print(f"Bot Probability: {probability[1]:.4f}")
print(f"Human Probability: {probability[0]:.4f}")

📋 Confusion Matrix Breakdown

Tuned Model (Test Set)

                Predicted
              Human    Bot
Actual Human     4676     309
       Bot        611    1891

True Negatives (TN): 4,676 (Correctly identified humans)
False Positives (FP): 309 (Humans incorrectly classified as bots)
False Negatives (FN): 611 (Bots incorrectly classified as humans)
True Positives (TP): 1,891 (Correctly identified bots)

🔍 Model Interpretation

Strengths

High ROC-AUC score (0.9354) indicates excellent discrimination capability
Balanced precision and recall for both classes
Robust cross-validation performance

Key Insights

Top features drive bot classification effectively
GridSearchCV improved performance over baseline by 0.43%
Model generalizes well on unseen test data

📝 Notes

Feature Scaling: All features are scaled using MinMaxScaler to [0, 1] range
Missing Values: Filled with 0 during preprocessing
Class Balance: Imbalanced dataset
Model Type: Ensemble method resistant to overfitting

🔄 Model Updates

To retrain the model:

Place new training data in ../data/train_twitter.csv
Run the training notebook: 5_enhanced_training.ipynb
Update this README with new metrics

📧 Contact & Support

For questions or issues regarding this model, please refer to the main project documentation.

Generated: 2025-11-27 12:08:54 Notebook: 5_enhanced_training.ipynb Platform: Twitter

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nahiar/twitter-bot-detection

Bot Detection

Collection

3 items • Updated 21 days ago