Update README.md

85e932c verified 8 months ago

3.89 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: text-classification
	library_name: scikit-learn
	tags:
	- password-strength
	- cybersecurity
	- random-forest
	- scikit-learn
	- password-classification
	- password-security
	- sklearn
	---
	# PasswordHealthModel

	Model Type: Random Forest Classifier
	Framework: scikit-learn
	Task: Password Strength Classification (Weak / Medium / Strong)

	## Overview

	PasswordHealthModel is a machine learning model that classifies passwords into three strength levels:

	- Weak (0)
	- Medium (1)
	- Strong (2)

	The model leverages a Random Forest Classifier trained on 300,000 labeled passwords and is designed for integration into password management systems to provide real-time strength evaluation and guidance.

	## Intended Uses

	- Integration into password managers (e.g., [Password Utility](https://github.com/naail-khokhar/password_utility)) for evaluating password health.
	- Providing real-time feedback on password strength and generating recommendations for stronger passwords.
	- Enforcing password strength policies in security-focused applications.

	## Training Data

	- Weak: 100,000 passwords sourced from the [SecLists dataset](https://github.com/danielmiessler/SecLists).
	- Medium: 100,000 synthetically generated passwords (8–12 characters, alphanumeric, 20% with symbols).
	- Strong: 100,000 synthetically generated passwords (12–16 characters, alphanumeric + symbols).

	All passwords were stripped of whitespace prior to feature extraction.

	## Features (10 Total)

	- length: Number of characters.
	- entropy: Shannon entropy of characters.
	- has_upper: Binary flag indicating presence of uppercase characters.
	- has_symbol: Binary flag indicating presence of special characters.
	- has_leet: Binary flag for leet-speak characters (e.g., @, 3, !, 0).
	- repetition: Binary flag for repeated sequences (≥3 consecutive repeated characters).
	- digit_ratio: Ratio of digits to total length.
	- unique_ratio: Ratio of unique characters to total length.
	- bigram_entropy: Entropy of character pairs (bigrams).
	- compression_ratio: Ratio of compressed length to original length using zlib compression.

	## Model Architecture

	- Algorithm: Random Forest Classifier (scikit-learn)
	- Hyperparameters:
	- `n_estimators`: 200
	- `max_depth`: 20
	- `min_samples_split`: 5
	- `random_state`: 42

	## Performance

	- Evaluation Setup: 80/20 train-test split (80% training, 20% testing; 240,000 training samples, 60,000 test samples)
	- Accuracy: ~96.7% (±0.6% standard deviation)

	## Limitations

	- Feature engineering is heuristic-based and may not fully capture all password patterns across different contexts.
	- Primarily trained on English-like and synthetic passwords.
	- Potential overfitting to synthetic strong password patterns.

	## Ethical Considerations

	Weak password data is sourced from publicly available breaches with careful handling. The model does not store actual user passwords and is intended only for classification tasks.

	## Dependencies

	My project relies on the following open-source libraries and datasets:

	- [pandas](https://github.com/pandas-dev/pandas): Data manipulation and analysis (BSD-3-Clause License).
	- [scikit-learn](https://github.com/scikit-learn/scikit-learn): Machine learning framework for the Random Forest Classifier (BSD-3-Clause License).
	- [joblib](https://github.com/joblib/joblib): Model persistence and parallel computation (MIT License).
	- [SecLists](https://github.com/danielmiessler/SecLists): Dataset for weak passwords (MIT License).

	If redistributing this project, please include the respective license texts for these dependencies.

	## Citation

	Khokhar, Naa'il Ahmad. (2025). PasswordHealthModel: A Random Forest Model for Password Strength Classification. Hugging Face Model Hub.