shmaisanymostafa
/

Phishing-Detection

+---
+tags:
+- phishing-detection
+- logistic-regression
+- tfidf
+- sklearn
+- datasets
+- huggingface
+license: mit
+---
+# Phishing Detection Model using Logistic Regression and TF-IDF
+This model is a phishing detection classifier built using TF-IDF for feature extraction and Logistic Regression for classification. It processes text data to identify phishing attempts with high accuracy.
+## Model Details
+- **Framework**: Scikit-learn
+- **Feature Extraction**: TF-IDF Vectorizer (top 5000 features)
+- **Algorithm**: Logistic Regression
+- **Dataset**: [ealvaradob/phishing-dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset) (combined_reduced subset)
+## Installation
+Before using the model, ensure you have the necessary dependencies installed:
+```bash
+pip install scikit-learn
+pip install -U "tensorflow-text==2.13.*"
+pip install "tf-models-official==2.13.*"
+pip uninstall -y pyarrow datasets
+pip install pyarrow datasets
+```
+## How to Use
+Below is an example of how to train and evaluate the model:
+```python
+from datasets import load_dataset
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.linear_model import LogisticRegression
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import accuracy_score
+# Load the dataset
+dataset_reduced = load_dataset("ealvaradob/phishing-dataset", "combined_reduced", trust_remote_code=True)
+# Convert to pandas DataFrame
+df = dataset_reduced['train'].to_pandas()
+# Extract text and labels
+text = df['text'].values
+labels = df['label'].values
+# Split the data into train and test sets
+train_text, test_text, train_labels, test_labels = train_test_split(
+    text, labels, test_size=0.2, random_state=42
+)
+# Create and fit the TF-IDF vectorizer
+vectorizer = TfidfVectorizer(max_features=5000)
+vectorizer.fit(train_text)
+# Transform the text data into numerical features
+train_features = vectorizer.transform(train_text)
+test_features = vectorizer.transform(test_text)
+# Create and train the logistic regression model
+model = LogisticRegression()
+model.fit(train_features, train_labels)
+# Make predictions on the test set
+predictions = model.predict(test_features)
+# Evaluate the model's accuracy
+accuracy = accuracy_score(test_labels, predictions)
+print(f'Accuracy: {accuracy}')
+```
+## Results
+- **Accuracy**: The model achieves an accuracy of `{{accuracy}}` on the test set.
+## Dataset
+The dataset used for training and evaluation is the [ealvaradob/phishing-dataset](https://huggingface.co/datasets/ealvaradob/phishing-dataset). It contains a variety of phishing and non-phishing samples labeled as `1` (phishing) and `0` (non-phishing).
+## Limitations and Future Work
+- The model uses a simple Logistic Regression algorithm, which may not capture complex patterns in text as effectively as deep learning models.
+- Future versions could incorporate advanced NLP techniques like BERT or transformer-based models.
+## License
+This project is licensed under the MIT License. Feel free to use, modify, and distribute this model as per the terms of the license.
+## Acknowledgements
+- [Hugging Face Datasets](https://huggingface.co/datasets)
+- [Scikit-learn](https://scikit-learn.org/)
+---
+license: apache-2.0
+---