---
title: Expert Models for Wordlist-Based DGA Detection
emoji: 🛡️
colorFrom: red
colorTo: orange
sdk: static
pinned: false
license: mit
tags:
  - dga-detection
  - cybersecurity
  - domain-generation-algorithm
  - wordlist-dga
  - malware-detection
  - network-security
  - bert
  - transformers
language:
  - en
---

# 🛡️ Expert Models for Wordlist-Based DGA Detection

[![Paper](https://img.shields.io/badge/Paper-Under%20Review-orange)](https://huggingface.co/Reynier/moe-wordlist-dga-models)
[![Models](https://img.shields.io/badge/Models-7%20Available-blue)](https://huggingface.co/Reynier/moe-wordlist-dga-models/tree/main/models)
[![Dataset](https://img.shields.io/badge/Dataset-160K%20Samples-green)](https://huggingface.co/Reynier/moe-wordlist-dga-models/tree/main/datasets)
[![License](https://img.shields.io/badge/License-MIT-yellow)](https://opensource.org/licenses/MIT)

> **Systematic evaluation of seven expert models for detecting wordlist-based Domain Generation Algorithms (DGAs), identifying ModernBERT as the optimal expert achieving 86.7% F1-score on known families and 80.9% on unseen variants.**

---

## 📋 Overview

This repository contains the complete implementation of expert model evaluation for wordlist-based DGA detection, as described in our research paper (currently under review). Wordlist-based DGAs generate linguistically coherent domains that evade traditional detection methods, making them particularly challenging for cybersecurity systems.

### 🎯 Key Findings

| Model | Known F1 | Unknown F1 | Inference Time | Throughput |
|-------|----------|------------|----------------|------------|
| **ModernBERT** ⭐ | **86.7%** | **80.9%** | **26ms** | **38 domains/s** |
| Gemma 3 4B LoRA | 82.1% | 75.3% | 650ms | 1.5/s |
| LLaMA 3.2 3B LoRA | 81.4% | 74.8% | 680ms | 1.4/s |
| DomBertUrl | 81.2% | 84.6% | 28ms | 35/s |
| CNN | 78.9% | 72.1% | 15ms | 66/s |
| FANCI (RF) | 77.3% | 68.5% | <1ms | >100/s |
| LABin | 75.6% | 70.2% | 18ms | 55/s |

**Key Improvement:** Specialist training provides **+9.4% F1-score improvement** over generalist approaches on known families and **+30.2%** on unseen families.

---

## 🚀 Quick Start

### Option 1: Use Pre-trained Model (Recommended)

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the optimal expert model
model_name = "Reynier/moe-wordlist-dga-models"
model_path = f"{model_name}/models/modernbert-wordlist-expert"

tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    subfolder="models/modernbert-wordlist-expert"
)

# Classify a domain
domain = "secure-banking-portal.com"
inputs = tokenizer(domain, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)

print(f"Benign: {probs[0][0]:.4f}")
print(f"DGA:    {probs[0][1]:.4f}")
```

### Option 2: Clone and Explore

```bash
# Clone the repository
git clone https://huggingface.co/Reynier/moe-wordlist-dga-models
cd moe-wordlist-dga-models

# Explore available models
ls models/

# Load datasets
python3 << EOF
import csv
with open('datasets/train_wl.csv', 'r') as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        if i < 5:
            print(row)
EOF
```

---

## 📦 Repository Contents

### 🤖 Models (7 Expert Candidates)

All models are located in `models/` directory:

1. **modernbert-wordlist-expert/** ⭐ (OPTIMAL)
   - Base: `answerdotai/ModernBERT-base`
   - F1-score: 86.7% (known), 80.9% (unknown)
   - Inference: 26ms on Tesla T4
   - Size: 575 MB

2. **modernbert-generalist-54f/** (Baseline)
   - Trained on 54 diverse DGA families
   - F1-score: 79.2% (known), 62.1% (unknown)
   - Demonstrates specialist advantage

3. **gemma-3-4b-lora/**
   - LoRA adapters for `google/gemma-3-4b-it`
   - Exceptional precision (95.4%), lower recall (66.5%)
   - Size: 95 MB (adapters only)

4. **llama-3.2-3b-lora/**
   - LoRA adapters for `meta-llama/Llama-3.2-3B`
   - Balanced performance, slow inference
   - Size: 110 MB (adapters only)

5. **dombert-url/**
   - Domain-specialized BERT variant
   - Strong generalization (84.6% on unknown)
   - Size: 1.4 MB (LoRA adapters)

6. **cnn-wordlist/**
   - Convolutional neural network
   - Fastest inference (15ms), moderate accuracy
   - Size: 76 KB

7. **fanci/**
   - Random Forest with engineered features
   - Traditional ML baseline
   - Size: 794 MB (includes dictionaries)

8. **labin/**
   - Hybrid linguistic-attention model
   - Keras implementation
   - Size: 8.1 MB

### 📊 Datasets

All datasets are in `datasets/` directory:

#### Training Datasets

1. **train_wl.csv** (160,000 samples)
   - **Purpose:** Train expert models (specialist approach)
   - **DGA Families (8):** charbot, deception, gozi, manuelita, matsnu, nymaim, rovnix, suppobox (10K each)
   - **Benign:** 80,000 domains from Tranco top sites
   - **Distribution:** Balanced (50% DGA, 50% benign)
   - **Format:** `domain,family,label`

2. **train_1M.csv** (1,080,000 samples)
   - **Purpose:** Train generalist model (baseline comparison)
   - **DGA Families:** 54 diverse families (wordlist + algorithmic)
   - **Distribution:** Diverse multi-family dataset

#### Test Datasets

3. **test-known/** (8 families)
   - **Purpose:** Evaluate performance on training families
   - **Total samples:** 723,847
   - **Families:**
     - charbot: 11,001 samples
     - deception: 30,001 samples
     - gozi: 50,212 samples
     - manuelita: 20,001 samples
     - matsnu: 116,480 samples
     - nymaim: 217,773 samples
     - rovnix: 120,351 samples
     - suppobox: 158,028 samples
   - **Format:** Compressed `.gz` files (one per family)

4. **test-generalization/** (3 families)
   - **Purpose:** Test generalization to unseen wordlist-based DGAs
   - **Total samples:** 13,562
   - **Families:**
     - bigviktor: 2,001 samples
     - ngioweb: 2,001 samples
     - pizd: 9,560 samples
   - **Format:** Compressed `.gz` files

### 📓 Notebooks

Training and evaluation notebooks in `notebooks/`:

- `ModernBERT_base_DGA_Word.ipynb` - Train ModernBERT expert (8 families)
- `ModernBERT_base_DGA_54F.ipynb` - Train ModernBERT generalist (54 families)
- `Train_Gemma3_4B_DGA_WordList.ipynb` - Fine-tune Gemma with LoRA
- `Train_llama3B_DGA_WordList.ipynb` - Fine-tune LLaMA with LoRA
- `Test_Gemma3_4B_DGA_Last.ipynb` - Evaluate Gemma model
- `Test__llama3B_DGA.ipynb` - Evaluate LLaMA model
- `DomUrlBert.ipynb` - Train DomBertUrl model
- `CNN_Patron_WL.ipynb` - Train CNN model
- `FANCI.ipynb` - Train FANCI Random Forest
- `Labin_wl.ipynb` - Train LABin model

---

## 🔬 Research Methodology

### Two-Phase Evaluation Protocol

1. **Phase 1: Known Families Performance**
   - Evaluate on 8 training families
   - 30 test batches × 100 domains per family
   - Measures detection accuracy on familiar variants

2. **Phase 2: Generalization Capability**
   - Evaluate on 3 unseen wordlist-based families
   - Tests robustness against novel DGA variants
   - Critical for real-world deployment

### Evaluation Metrics

- **Precision:** Accuracy of DGA predictions
- **Recall:** Coverage of actual DGAs
- **F1-Score:** Harmonic mean (primary metric)
- **False Positive Rate (FPR):** Benign misclassification rate
- **Inference Time:** Real-world performance (Tesla T4 GPU)

### DGA Families

**Training Families (8 wordlist-based):**
- charbot, deception, gozi, manuelita, matsnu, nymaim, rovnix, suppobox

**Generalization Test (3 wordlist-based):**
- bigviktor, ngioweb, pizd

---

## 🛠️ Installation & Requirements

### Basic Requirements

```bash
pip install torch transformers safetensors
```

### For LLM Models (Gemma/LLaMA)

```bash
pip install peft accelerate bitsandbytes
```

### For Traditional Models (FANCI)

```bash
pip install scikit-learn joblib
```

### For LABin Model

```bash
pip install tensorflow keras
```

### GPU Recommendations

- **Optimal:** NVIDIA Tesla T4 or better
- **Minimum:** 8GB VRAM for ModernBERT
- **LLMs:** 16GB+ VRAM (or use 8-bit quantization)

---

## 📊 Reproducibility

All experiments are fully reproducible:

1. **Download datasets** from `datasets/` folder
2. **Run training notebooks** from `notebooks/`
3. **Load pre-trained models** from `models/`
4. **Verify reported metrics** using test sets

### Expected Results (±std)

| Model | Known F1 | Unknown F1 |
|-------|----------|------------|
| ModernBERT | 86.7% ± 3.0% | 80.9% ± 4.5% |
| Generalist | 79.2% ± 3.5% | 62.1% ± 5.2% |

---

## 📖 Citation

This work is currently under review. Preliminary citation:

```bibtex
@article{leyva2025expert,
  title={Expert Selection for Wordlist-Based DGA Detection: A Systematic Evaluation},
  author={Leyva La O, Reynier and Catania, Carlos A. and Gonzalez, Rodrigo},
  journal={Under Review},
  year={2025}
}
```

---

## 🔗 Links

- **GitHub Repository:** [MoE-word-list-dga-detection](https://github.com/reypapin/MoE-word-list-dga-detection)
- **Paper:** Under Review
- **Contact:** rleyvalao@mendoza-conicet.gob.ar

---

## 📄 License

This project is licensed under the MIT License - see LICENSE file for details.

---

## 🙏 Acknowledgments

- **Datasets:** DGArchive, 360 Netlab, UMUDga, Tranco
- **Base Models:** ModernBERT (Answer.AI), Gemma (Google), LLaMA (Meta)
- **Infrastructure:** CONICET Argentina

---

## 🔍 Quick Navigation

- [Models Directory](./models/)
- [Datasets Directory](./datasets/)
- [Training Notebooks](./notebooks/)
- [Results & Metrics](https://github.com/reypapin/MoE-word-list-dga-detection/tree/main/Result_csv)

---

**Last Updated:** October 2025