TWITTER Bot Detection Model
Overview
This directory contains a trained Random Forest classifier for detecting bot accounts on Twitter.
Model Version: v2
Training Date: 2025-11-27 12:08:54
Framework: scikit-learn 1.5.2
Algorithm: Random Forest Classifier with GridSearchCV Hyperparameter Tuning
π Model Performance
Final Metrics (Test Set)
| Metric |
Score |
| Accuracy |
0.8771 (87.71%) |
| Precision |
0.8595 (85.95%) |
| Recall |
0.7558 (75.58%) |
| F1-Score |
0.8043 (80.43%) |
| ROC-AUC |
0.9354 (93.54%) |
| Average Precision |
0.9008 (90.08%) |
Model Improvement
- Baseline ROC-AUC: 0.9314
- Tuned ROC-AUC: 0.9354
- Improvement: 0.0040 (0.43%)
ποΈ Files
| File |
Description |
twitter_bot_detection_v2.pkl |
Trained Random Forest model |
twitter_scaler_v2.pkl |
MinMaxScaler for feature normalization |
twitter_features_v2.json |
List of features used by the model |
twitter_metrics_v2.txt |
Detailed performance metrics report |
images/ |
All visualization plots (13 images) |
README.md |
This file |
π― Dataset Information
Training Configuration
- Training Samples: 29,951
- Test Samples: 7,487
- Total Samples: 37,438
- Number of Features: 12
- Cross-Validation Folds: 5
- Random State: 42
Class Distribution
Training Set:
- Human (0): 20,028 (66.87%)
- Bot (1): 9,923 (33.13%)
Test Set:
- Human (0): 4,985 (66.58%)
- Bot (1): 2,502 (33.42%)
π§ Features (12)
has_custom_cover_image
description_length
favourites_count
followers_count
friends_count
followers_to_friends_ratio
has_location
username_digit_count
username_length
statuses_count
is_verified
account_age_days
π Top 5 Most Important Features
- followers_count - 0.1895
- favourites_count - 0.1813
- friends_count - 0.1494
- statuses_count - 0.1244
- account_age_days - 0.1010
βοΈ Hyperparameters
Best Parameters (from GridSearchCV)
- class_weight: balanced
- max_depth: 20
- max_features: sqrt
- min_samples_leaf: 1
- min_samples_split: 2
- n_estimators: 300
Parameter Search Space
- n_estimators: [100, 200, 300]
- max_depth: [10, 15, 20, None]
- min_samples_split: [2, 5, 10]
- min_samples_leaf: [1, 2, 4]
- max_features: ['sqrt', 'log2']
- bootstrap: [True, False]
Total combinations tested: 540
π Cross-Validation Results
Mean Scores (5-Fold Stratified CV)
- Accuracy: 0.8750 (Β±0.0053)
- Precision: 0.8658 (Β±0.0089)
- Recall: 0.7368 (Β±0.0113)
- F1-Score: 0.7961 (Β±0.0092)
- ROC-AUC: 0.9325 (Β±0.0037)
πΌοΈ Visualizations
All visualizations are saved in the images/ directory:
- 01_class_distribution.png - Training/Test set class distribution
- 02_feature_correlation.png - Feature correlation with target variable
- 03_correlation_matrix.png - Feature correlation heatmap
- 04_baseline_confusion_matrix.png - Baseline model confusion matrix
- 05_baseline_roc_curve.png - Baseline ROC curve
- 06_baseline_precision_recall.png - Baseline Precision-Recall curve
- 07_baseline_feature_importance.png - Baseline feature importance
- 08_cross_validation.png - Cross-validation score distribution
- 09_tuned_confusion_matrix.png - Tuned model confusion matrix
- 10_tuned_roc_curve.png - Tuned ROC curve
- 11_tuned_precision_recall.png - Tuned Precision-Recall curve
- 12_tuned_feature_importance.png - Tuned feature importance
- 13_model_comparison.png - Baseline vs Tuned comparison
π Usage Example
import joblib
import pandas as pd
import numpy as np
model = joblib.load('twitter_bot_detection_v2.pkl')
scaler = joblib.load('twitter_scaler_v2.pkl')
data = {
'has_custom_cover_image': 0.5,
'description_length': 0.5,
'favourites_count': 0.5,
'followers_count': 0.5,
'friends_count': 0.5,
'followers_to_friends_ratio': 0.5,
'has_location': 0.5,
'username_digit_count': 0.5,
'username_length': 0.5,
'statuses_count': 0.5,
'is_verified': 0.5,
'account_age_days': 0.5,
}
df = pd.DataFrame([data])
df_scaled = scaler.transform(df)
prediction = model.predict(df_scaled)[0]
probability = model.predict_proba(df_scaled)[0]
print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
print(f"Bot Probability: {probability[1]:.4f}")
print(f"Human Probability: {probability[0]:.4f}")
π Confusion Matrix Breakdown
Tuned Model (Test Set)
Predicted
Human Bot
Actual Human 4676 309
Bot 611 1891
- True Negatives (TN): 4,676 (Correctly identified humans)
- False Positives (FP): 309 (Humans incorrectly classified as bots)
- False Negatives (FN): 611 (Bots incorrectly classified as humans)
- True Positives (TP): 1,891 (Correctly identified bots)
π Model Interpretation
Strengths
- High ROC-AUC score (0.9354) indicates excellent discrimination capability
- Balanced precision and recall for both classes
- Robust cross-validation performance
Key Insights
- Top features drive bot classification effectively
- GridSearchCV improved performance over baseline by 0.43%
- Model generalizes well on unseen test data
π Notes
- Feature Scaling: All features are scaled using MinMaxScaler to [0, 1] range
- Missing Values: Filled with 0 during preprocessing
- Class Balance: Imbalanced dataset
- Model Type: Ensemble method resistant to overfitting
π Model Updates
To retrain the model:
- Place new training data in
../data/train_twitter.csv
- Run the training notebook:
5_enhanced_training.ipynb
- Update this README with new metrics
π§ Contact & Support
For questions or issues regarding this model, please refer to the main project documentation.
Generated: 2025-11-27 12:08:54
Notebook: 5_enhanced_training.ipynb
Platform: Twitter