--- license: mit language: - en pipeline_tag: text-classification library_name: scikit-learn tags: - password-strength - cybersecurity - random-forest - scikit-learn - password-classification - password-security - sklearn --- # PasswordHealthModel **Model Type**: Random Forest Classifier **Framework**: scikit-learn **Task**: Password Strength Classification (Weak / Medium / Strong) ## Overview PasswordHealthModel is a machine learning model that classifies passwords into three strength levels: - **Weak (0)** - **Medium (1)** - **Strong (2)** The model leverages a Random Forest Classifier trained on 300,000 labeled passwords and is designed for integration into password management systems to provide real-time strength evaluation and guidance. ## Intended Uses - Integration into password managers (e.g., [Password Utility](https://github.com/naail-khokhar/password_utility)) for evaluating password health. - Providing real-time feedback on password strength and generating recommendations for stronger passwords. - Enforcing password strength policies in security-focused applications. ## Training Data - **Weak**: 100,000 passwords sourced from the [SecLists dataset](https://github.com/danielmiessler/SecLists). - **Medium**: 100,000 synthetically generated passwords (8–12 characters, alphanumeric, 20% with symbols). - **Strong**: 100,000 synthetically generated passwords (12–16 characters, alphanumeric + symbols). All passwords were stripped of whitespace prior to feature extraction. ## Features (10 Total) - **length**: Number of characters. - **entropy**: Shannon entropy of characters. - **has_upper**: Binary flag indicating presence of uppercase characters. - **has_symbol**: Binary flag indicating presence of special characters. - **has_leet**: Binary flag for leet-speak characters (e.g., @, 3, !, 0). - **repetition**: Binary flag for repeated sequences (≥3 consecutive repeated characters). - **digit_ratio**: Ratio of digits to total length. - **unique_ratio**: Ratio of unique characters to total length. - **bigram_entropy**: Entropy of character pairs (bigrams). - **compression_ratio**: Ratio of compressed length to original length using zlib compression. ## Model Architecture - **Algorithm**: Random Forest Classifier (scikit-learn) - **Hyperparameters**: - `n_estimators`: 200 - `max_depth`: 20 - `min_samples_split`: 5 - `random_state`: 42 ## Performance - **Evaluation Setup**: 80/20 train-test split (80% training, 20% testing; 240,000 training samples, 60,000 test samples) - **Accuracy**: ~96.7% (±0.6% standard deviation) ## Limitations - Feature engineering is heuristic-based and may not fully capture all password patterns across different contexts. - Primarily trained on English-like and synthetic passwords. - Potential overfitting to synthetic strong password patterns. ## Ethical Considerations Weak password data is sourced from publicly available breaches with careful handling. The model does not store actual user passwords and is intended only for classification tasks. ## Dependencies My project relies on the following open-source libraries and datasets: - **[pandas](https://github.com/pandas-dev/pandas)**: Data manipulation and analysis (BSD-3-Clause License). - **[scikit-learn](https://github.com/scikit-learn/scikit-learn)**: Machine learning framework for the Random Forest Classifier (BSD-3-Clause License). - **[joblib](https://github.com/joblib/joblib)**: Model persistence and parallel computation (MIT License). - **[SecLists](https://github.com/danielmiessler/SecLists)**: Dataset for weak passwords (MIT License). If redistributing this project, please include the respective license texts for these dependencies. ## Citation Khokhar, Naa'il Ahmad. (2025). *PasswordHealthModel: A Random Forest Model for Password Strength Classification*. Hugging Face Model Hub.