TWITTER Bot Detection Model

Overview

This directory contains a trained Random Forest classifier for detecting bot accounts on Twitter.

Model Version: v2 Training Date: 2025-11-27 12:08:54 Framework: scikit-learn 1.5.2 Algorithm: Random Forest Classifier with GridSearchCV Hyperparameter Tuning


πŸ“Š Model Performance

Final Metrics (Test Set)

Metric Score
Accuracy 0.8771 (87.71%)
Precision 0.8595 (85.95%)
Recall 0.7558 (75.58%)
F1-Score 0.8043 (80.43%)
ROC-AUC 0.9354 (93.54%)
Average Precision 0.9008 (90.08%)

Model Improvement

  • Baseline ROC-AUC: 0.9314
  • Tuned ROC-AUC: 0.9354
  • Improvement: 0.0040 (0.43%)

πŸ—‚οΈ Files

File Description
twitter_bot_detection_v2.pkl Trained Random Forest model
twitter_scaler_v2.pkl MinMaxScaler for feature normalization
twitter_features_v2.json List of features used by the model
twitter_metrics_v2.txt Detailed performance metrics report
images/ All visualization plots (13 images)
README.md This file

🎯 Dataset Information

Training Configuration

  • Training Samples: 29,951
  • Test Samples: 7,487
  • Total Samples: 37,438
  • Number of Features: 12
  • Cross-Validation Folds: 5
  • Random State: 42

Class Distribution

Training Set:

  • Human (0): 20,028 (66.87%)
  • Bot (1): 9,923 (33.13%)

Test Set:

  • Human (0): 4,985 (66.58%)
  • Bot (1): 2,502 (33.42%)

πŸ”§ Features (12)

  1. has_custom_cover_image
  2. description_length
  3. favourites_count
  4. followers_count
  5. friends_count
  6. followers_to_friends_ratio
  7. has_location
  8. username_digit_count
  9. username_length
  10. statuses_count
  11. is_verified
  12. account_age_days

πŸ† Top 5 Most Important Features

  1. followers_count - 0.1895
  2. favourites_count - 0.1813
  3. friends_count - 0.1494
  4. statuses_count - 0.1244
  5. account_age_days - 0.1010

βš™οΈ Hyperparameters

Best Parameters (from GridSearchCV)

  • class_weight: balanced
  • max_depth: 20
  • max_features: sqrt
  • min_samples_leaf: 1
  • min_samples_split: 2
  • n_estimators: 300

Parameter Search Space

  • n_estimators: [100, 200, 300]
  • max_depth: [10, 15, 20, None]
  • min_samples_split: [2, 5, 10]
  • min_samples_leaf: [1, 2, 4]
  • max_features: ['sqrt', 'log2']
  • bootstrap: [True, False]

Total combinations tested: 540


πŸ“ˆ Cross-Validation Results

Mean Scores (5-Fold Stratified CV)

  • Accuracy: 0.8750 (Β±0.0053)
  • Precision: 0.8658 (Β±0.0089)
  • Recall: 0.7368 (Β±0.0113)
  • F1-Score: 0.7961 (Β±0.0092)
  • ROC-AUC: 0.9325 (Β±0.0037)

πŸ–ΌοΈ Visualizations

All visualizations are saved in the images/ directory:

  1. 01_class_distribution.png - Training/Test set class distribution
  2. 02_feature_correlation.png - Feature correlation with target variable
  3. 03_correlation_matrix.png - Feature correlation heatmap
  4. 04_baseline_confusion_matrix.png - Baseline model confusion matrix
  5. 05_baseline_roc_curve.png - Baseline ROC curve
  6. 06_baseline_precision_recall.png - Baseline Precision-Recall curve
  7. 07_baseline_feature_importance.png - Baseline feature importance
  8. 08_cross_validation.png - Cross-validation score distribution
  9. 09_tuned_confusion_matrix.png - Tuned model confusion matrix
  10. 10_tuned_roc_curve.png - Tuned ROC curve
  11. 11_tuned_precision_recall.png - Tuned Precision-Recall curve
  12. 12_tuned_feature_importance.png - Tuned feature importance
  13. 13_model_comparison.png - Baseline vs Tuned comparison

πŸš€ Usage Example

import joblib
import pandas as pd
import numpy as np

# Load model and scaler
model = joblib.load('twitter_bot_detection_v2.pkl')
scaler = joblib.load('twitter_scaler_v2.pkl')

# Prepare your data (example)
data = {
    'has_custom_cover_image': 0.5,
    'description_length': 0.5,
    'favourites_count': 0.5,
    'followers_count': 0.5,
    'friends_count': 0.5,
    'followers_to_friends_ratio': 0.5,
    'has_location': 0.5,
    'username_digit_count': 0.5,
    'username_length': 0.5,
    'statuses_count': 0.5,
    'is_verified': 0.5,
    'account_age_days': 0.5,
}

# Create DataFrame
df = pd.DataFrame([data])

# Scale features
df_scaled = scaler.transform(df)

# Predict
prediction = model.predict(df_scaled)[0]
probability = model.predict_proba(df_scaled)[0]

print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
print(f"Bot Probability: {probability[1]:.4f}")
print(f"Human Probability: {probability[0]:.4f}")

πŸ“‹ Confusion Matrix Breakdown

Tuned Model (Test Set)

                Predicted
              Human    Bot
Actual Human     4676     309
       Bot        611    1891
  • True Negatives (TN): 4,676 (Correctly identified humans)
  • False Positives (FP): 309 (Humans incorrectly classified as bots)
  • False Negatives (FN): 611 (Bots incorrectly classified as humans)
  • True Positives (TP): 1,891 (Correctly identified bots)

πŸ” Model Interpretation

Strengths

  • High ROC-AUC score (0.9354) indicates excellent discrimination capability
  • Balanced precision and recall for both classes
  • Robust cross-validation performance

Key Insights

  1. Top features drive bot classification effectively
  2. GridSearchCV improved performance over baseline by 0.43%
  3. Model generalizes well on unseen test data

πŸ“ Notes

  • Feature Scaling: All features are scaled using MinMaxScaler to [0, 1] range
  • Missing Values: Filled with 0 during preprocessing
  • Class Balance: Imbalanced dataset
  • Model Type: Ensemble method resistant to overfitting

πŸ”„ Model Updates

To retrain the model:

  1. Place new training data in ../data/train_twitter.csv
  2. Run the training notebook: 5_enhanced_training.ipynb
  3. Update this README with new metrics

πŸ“§ Contact & Support

For questions or issues regarding this model, please refer to the main project documentation.


Generated: 2025-11-27 12:08:54 Notebook: 5_enhanced_training.ipynb Platform: Twitter

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including nahiar/twitter-bot-detection