--- title: Expert Models for Wordlist-Based DGA Detection emoji: πŸ›‘οΈ colorFrom: red colorTo: orange sdk: static pinned: false license: mit tags: - dga-detection - cybersecurity - domain-generation-algorithm - wordlist-dga - malware-detection - network-security - bert - transformers language: - en --- # πŸ›‘οΈ Expert Models for Wordlist-Based DGA Detection [![Paper](https://img.shields.io/badge/Paper-Under%20Review-orange)](https://huggingface.co/Reynier/moe-wordlist-dga-models) [![Models](https://img.shields.io/badge/Models-7%20Available-blue)](https://huggingface.co/Reynier/moe-wordlist-dga-models/tree/main/models) [![Dataset](https://img.shields.io/badge/Dataset-160K%20Samples-green)](https://huggingface.co/Reynier/moe-wordlist-dga-models/tree/main/datasets) [![License](https://img.shields.io/badge/License-MIT-yellow)](https://opensource.org/licenses/MIT) > **Systematic evaluation of seven expert models for detecting wordlist-based Domain Generation Algorithms (DGAs), identifying ModernBERT as the optimal expert achieving 86.7% F1-score on known families and 80.9% on unseen variants.** --- ## πŸ“‹ Overview This repository contains the complete implementation of expert model evaluation for wordlist-based DGA detection, as described in our research paper (currently under review). Wordlist-based DGAs generate linguistically coherent domains that evade traditional detection methods, making them particularly challenging for cybersecurity systems. ### 🎯 Key Findings | Model | Known F1 | Unknown F1 | Inference Time | Throughput | |-------|----------|------------|----------------|------------| | **ModernBERT** ⭐ | **86.7%** | **80.9%** | **26ms** | **38 domains/s** | | Gemma 3 4B LoRA | 82.1% | 75.3% | 650ms | 1.5/s | | LLaMA 3.2 3B LoRA | 81.4% | 74.8% | 680ms | 1.4/s | | DomBertUrl | 81.2% | 84.6% | 28ms | 35/s | | CNN | 78.9% | 72.1% | 15ms | 66/s | | FANCI (RF) | 77.3% | 68.5% | <1ms | >100/s | | LABin | 75.6% | 70.2% | 18ms | 55/s | **Key Improvement:** Specialist training provides **+9.4% F1-score improvement** over generalist approaches on known families and **+30.2%** on unseen families. --- ## πŸš€ Quick Start ### Option 1: Use Pre-trained Model (Recommended) ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load the optimal expert model model_name = "Reynier/moe-wordlist-dga-models" model_path = f"{model_name}/models/modernbert-wordlist-expert" tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base") model = AutoModelForSequenceClassification.from_pretrained( model_name, subfolder="models/modernbert-wordlist-expert" ) # Classify a domain domain = "secure-banking-portal.com" inputs = tokenizer(domain, return_tensors="pt", truncation=True, max_length=128) with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=1) print(f"Benign: {probs[0][0]:.4f}") print(f"DGA: {probs[0][1]:.4f}") ``` ### Option 2: Clone and Explore ```bash # Clone the repository git clone https://huggingface.co/Reynier/moe-wordlist-dga-models cd moe-wordlist-dga-models # Explore available models ls models/ # Load datasets python3 << EOF import csv with open('datasets/train_wl.csv', 'r') as f: reader = csv.DictReader(f) for i, row in enumerate(reader): if i < 5: print(row) EOF ``` --- ## πŸ“¦ Repository Contents ### πŸ€– Models (7 Expert Candidates) All models are located in `models/` directory: 1. **modernbert-wordlist-expert/** ⭐ (OPTIMAL) - Base: `answerdotai/ModernBERT-base` - F1-score: 86.7% (known), 80.9% (unknown) - Inference: 26ms on Tesla T4 - Size: 575 MB 2. **modernbert-generalist-54f/** (Baseline) - Trained on 54 diverse DGA families - F1-score: 79.2% (known), 62.1% (unknown) - Demonstrates specialist advantage 3. **gemma-3-4b-lora/** - LoRA adapters for `google/gemma-3-4b-it` - Exceptional precision (95.4%), lower recall (66.5%) - Size: 95 MB (adapters only) 4. **llama-3.2-3b-lora/** - LoRA adapters for `meta-llama/Llama-3.2-3B` - Balanced performance, slow inference - Size: 110 MB (adapters only) 5. **dombert-url/** - Domain-specialized BERT variant - Strong generalization (84.6% on unknown) - Size: 1.4 MB (LoRA adapters) 6. **cnn-wordlist/** - Convolutional neural network - Fastest inference (15ms), moderate accuracy - Size: 76 KB 7. **fanci/** - Random Forest with engineered features - Traditional ML baseline - Size: 794 MB (includes dictionaries) 8. **labin/** - Hybrid linguistic-attention model - Keras implementation - Size: 8.1 MB ### πŸ“Š Datasets All datasets are in `datasets/` directory: #### Training Datasets 1. **train_wl.csv** (160,000 samples) - **Purpose:** Train expert models (specialist approach) - **DGA Families (8):** charbot, deception, gozi, manuelita, matsnu, nymaim, rovnix, suppobox (10K each) - **Benign:** 80,000 domains from Tranco top sites - **Distribution:** Balanced (50% DGA, 50% benign) - **Format:** `domain,family,label` 2. **train_1M.csv** (1,080,000 samples) - **Purpose:** Train generalist model (baseline comparison) - **DGA Families:** 54 diverse families (wordlist + algorithmic) - **Distribution:** Diverse multi-family dataset #### Test Datasets 3. **test-known/** (8 families) - **Purpose:** Evaluate performance on training families - **Total samples:** 723,847 - **Families:** - charbot: 11,001 samples - deception: 30,001 samples - gozi: 50,212 samples - manuelita: 20,001 samples - matsnu: 116,480 samples - nymaim: 217,773 samples - rovnix: 120,351 samples - suppobox: 158,028 samples - **Format:** Compressed `.gz` files (one per family) 4. **test-generalization/** (3 families) - **Purpose:** Test generalization to unseen wordlist-based DGAs - **Total samples:** 13,562 - **Families:** - bigviktor: 2,001 samples - ngioweb: 2,001 samples - pizd: 9,560 samples - **Format:** Compressed `.gz` files ### πŸ““ Notebooks Training and evaluation notebooks in `notebooks/`: - `ModernBERT_base_DGA_Word.ipynb` - Train ModernBERT expert (8 families) - `ModernBERT_base_DGA_54F.ipynb` - Train ModernBERT generalist (54 families) - `Train_Gemma3_4B_DGA_WordList.ipynb` - Fine-tune Gemma with LoRA - `Train_llama3B_DGA_WordList.ipynb` - Fine-tune LLaMA with LoRA - `Test_Gemma3_4B_DGA_Last.ipynb` - Evaluate Gemma model - `Test__llama3B_DGA.ipynb` - Evaluate LLaMA model - `DomUrlBert.ipynb` - Train DomBertUrl model - `CNN_Patron_WL.ipynb` - Train CNN model - `FANCI.ipynb` - Train FANCI Random Forest - `Labin_wl.ipynb` - Train LABin model --- ## πŸ”¬ Research Methodology ### Two-Phase Evaluation Protocol 1. **Phase 1: Known Families Performance** - Evaluate on 8 training families - 30 test batches Γ— 100 domains per family - Measures detection accuracy on familiar variants 2. **Phase 2: Generalization Capability** - Evaluate on 3 unseen wordlist-based families - Tests robustness against novel DGA variants - Critical for real-world deployment ### Evaluation Metrics - **Precision:** Accuracy of DGA predictions - **Recall:** Coverage of actual DGAs - **F1-Score:** Harmonic mean (primary metric) - **False Positive Rate (FPR):** Benign misclassification rate - **Inference Time:** Real-world performance (Tesla T4 GPU) ### DGA Families **Training Families (8 wordlist-based):** - charbot, deception, gozi, manuelita, matsnu, nymaim, rovnix, suppobox **Generalization Test (3 wordlist-based):** - bigviktor, ngioweb, pizd --- ## πŸ› οΈ Installation & Requirements ### Basic Requirements ```bash pip install torch transformers safetensors ``` ### For LLM Models (Gemma/LLaMA) ```bash pip install peft accelerate bitsandbytes ``` ### For Traditional Models (FANCI) ```bash pip install scikit-learn joblib ``` ### For LABin Model ```bash pip install tensorflow keras ``` ### GPU Recommendations - **Optimal:** NVIDIA Tesla T4 or better - **Minimum:** 8GB VRAM for ModernBERT - **LLMs:** 16GB+ VRAM (or use 8-bit quantization) --- ## πŸ“Š Reproducibility All experiments are fully reproducible: 1. **Download datasets** from `datasets/` folder 2. **Run training notebooks** from `notebooks/` 3. **Load pre-trained models** from `models/` 4. **Verify reported metrics** using test sets ### Expected Results (Β±std) | Model | Known F1 | Unknown F1 | |-------|----------|------------| | ModernBERT | 86.7% Β± 3.0% | 80.9% Β± 4.5% | | Generalist | 79.2% Β± 3.5% | 62.1% Β± 5.2% | --- ## πŸ“– Citation This work is currently under review. Preliminary citation: ```bibtex @article{leyva2025expert, title={Expert Selection for Wordlist-Based DGA Detection: A Systematic Evaluation}, author={Leyva La O, Reynier and Catania, Carlos A. and Gonzalez, Rodrigo}, journal={Under Review}, year={2025} } ``` --- ## πŸ”— Links - **GitHub Repository:** [MoE-word-list-dga-detection](https://github.com/reypapin/MoE-word-list-dga-detection) - **Paper:** Under Review - **Contact:** rleyvalao@mendoza-conicet.gob.ar --- ## πŸ“„ License This project is licensed under the MIT License - see LICENSE file for details. --- ## πŸ™ Acknowledgments - **Datasets:** DGArchive, 360 Netlab, UMUDga, Tranco - **Base Models:** ModernBERT (Answer.AI), Gemma (Google), LLaMA (Meta) - **Infrastructure:** CONICET Argentina --- ## πŸ” Quick Navigation - [Models Directory](./models/) - [Datasets Directory](./datasets/) - [Training Notebooks](./notebooks/) - [Results & Metrics](https://github.com/reypapin/MoE-word-list-dga-detection/tree/main/Result_csv) --- **Last Updated:** October 2025