Reynier commited on
Commit
be28b22
·
verified ·
1 Parent(s): ad38be7

Add README.md

Browse files
Files changed (1) hide show
  1. models/labin/README.md +225 -0
models/labin/README.md ADDED
@@ -0,0 +1,225 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LABin Model - Neural Network Approach
2
+
3
+ This directory contains the LABin (LABoratory of INformatics) neural network model, which represents a deep learning approach to wordlist-based DGA detection using custom neural architectures.
4
+
5
+ ## Model Overview
6
+
7
+ LABin implements a specialized neural network architecture designed for domain name analysis. This model explores the effectiveness of custom deep learning approaches compared to transformer-based methods and traditional machine learning baselines.
8
+
9
+ ## Model Performance
10
+
11
+ ### Architecture Details
12
+ - **Type**: Custom neural network for domain classification
13
+ - **Framework**: TensorFlow/Keras implementation
14
+ - **Training**: Specialized on wordlist-based DGA patterns
15
+ - **Focus**: Learning hierarchical domain name representations
16
+
17
+ ## Files Included
18
+
19
+ ### Model Files (Missing - See LARGE_FILES.md)
20
+ - `LABin_best_model_2025-05-30_15_26_47.keras`: Trained LABin model (~200MB)
21
+
22
+ ### Training Information
23
+ - **Training Date**: May 30, 2025, 15:26:47
24
+ - **Format**: Keras model format (.keras)
25
+ - **Architecture**: Custom neural network layers
26
+
27
+ ## LABin Architecture
28
+
29
+ ### Neural Network Design
30
+ The LABin model employs a custom architecture optimized for domain name pattern recognition:
31
+
32
+ #### Input Processing
33
+ - **Domain Tokenization**: Character-level or subword tokenization
34
+ - **Sequence Encoding**: Fixed-length sequence representation
35
+ - **Embedding Layer**: Learned character/token embeddings
36
+ - **Positional Encoding**: Position-aware representations
37
+
38
+ #### Core Architecture
39
+ - **Hidden Layers**: Multiple fully connected or recurrent layers
40
+ - **Activation Functions**: ReLU, sigmoid, or custom activations
41
+ - **Regularization**: Dropout, batch normalization
42
+ - **Feature Extraction**: Hierarchical pattern learning
43
+
44
+ #### Output Layer
45
+ - **Classification Head**: Binary or multi-class classification
46
+ - **Output Activation**: Softmax for probability distribution
47
+ - **Loss Function**: Cross-entropy or custom loss
48
+
49
+ ### Training Strategy
50
+ - **Optimization**: Adam or custom optimizer
51
+ - **Learning Rate**: Adaptive learning rate scheduling
52
+ - **Batch Size**: Optimized for available hardware
53
+ - **Regularization**: Early stopping, dropout
54
+
55
+ ## Usage Example
56
+
57
+ ```python
58
+ import tensorflow as tf
59
+ from tensorflow import keras
60
+ import numpy as np
61
+
62
+ # Note: Model file requires access to large files (see LARGE_FILES.md)
63
+ # model = keras.models.load_model('Models/LABIN/LABin_best_model_2025-05-30_15_26_47.keras')
64
+
65
+ # Example model architecture (representative)
66
+ def create_labin_model(vocab_size=1000, max_length=100, embedding_dim=128):
67
+ """Create a LABin-style model architecture"""
68
+
69
+ model = keras.Sequential([
70
+ # Input and embedding layers
71
+ keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
72
+
73
+ # Feature extraction layers
74
+ keras.layers.LSTM(64, return_sequences=True),
75
+ keras.layers.Dropout(0.3),
76
+ keras.layers.LSTM(32),
77
+ keras.layers.Dropout(0.3),
78
+
79
+ # Classification layers
80
+ keras.layers.Dense(64, activation='relu'),
81
+ keras.layers.Dropout(0.5),
82
+ keras.layers.Dense(32, activation='relu'),
83
+ keras.layers.Dense(2, activation='softmax') # Binary classification
84
+ ])
85
+
86
+ model.compile(
87
+ optimizer='adam',
88
+ loss='categorical_crossentropy',
89
+ metrics=['accuracy', 'precision', 'recall']
90
+ )
91
+
92
+ return model
93
+
94
+ # Example preprocessing
95
+ def preprocess_domain(domain, max_length=100):
96
+ """Preprocess domain for LABin model"""
97
+ # Character-level tokenization
98
+ chars = list(domain.lower())
99
+
100
+ # Create character-to-index mapping
101
+ char_to_idx = {chr(i): i-96 for i in range(97, 123)} # a-z
102
+ char_to_idx.update({'.': 27, '-': 28, '_': 29})
103
+
104
+ # Convert to indices
105
+ indices = [char_to_idx.get(c, 0) for c in chars]
106
+
107
+ # Pad or truncate to max_length
108
+ if len(indices) < max_length:
109
+ indices.extend([0] * (max_length - len(indices)))
110
+ else:
111
+ indices = indices[:max_length]
112
+
113
+ return np.array(indices)
114
+
115
+ # Example usage
116
+ domain_example = "secure-banking-portal.com"
117
+ processed_domain = preprocess_domain(domain_example)
118
+ print(f"Processed domain shape: {processed_domain.shape}")
119
+
120
+ # Model prediction example (requires loaded model)
121
+ # prediction = model.predict(processed_domain.reshape(1, -1))
122
+ # is_dga = prediction[0][1] > 0.5
123
+ # confidence = prediction[0][1]
124
+ ```
125
+
126
+ ## Training Details
127
+
128
+ ### Dataset Configuration
129
+ - **Training Size**: 160,000 domains from wordlist-based DGA families
130
+ - **Validation Split**: 20% held-out for validation
131
+ - **Test Set**: Separate unseen families for generalization testing
132
+ - **Class Balance**: Balanced representation across DGA families
133
+
134
+ ### Training Process
135
+ 1. **Data Preprocessing**: Domain tokenization and sequence encoding
136
+ 2. **Model Architecture**: Custom neural network design
137
+ 3. **Training Loop**: Iterative optimization with validation monitoring
138
+ 4. **Model Selection**: Best checkpoint based on validation performance
139
+ 5. **Evaluation**: Testing on unseen DGA families
140
+
141
+ ### Training Configuration
142
+ - **Epochs**: Variable with early stopping
143
+ - **Batch Size**: Optimized for memory and convergence
144
+ - **Learning Rate**: Adaptive scheduling
145
+ - **Regularization**: Dropout and early stopping
146
+
147
+ ## Performance Characteristics
148
+
149
+ ### Strengths
150
+ - **Pattern Learning**: Automatic feature extraction from raw domains
151
+ - **Flexibility**: Customizable architecture for specific requirements
152
+ - **Neural Representations**: Rich learned domain representations
153
+ - **End-to-end Training**: Direct optimization for DGA detection
154
+
155
+ ### Limitations
156
+ - **Computational Requirements**: More expensive than traditional ML
157
+ - **Training Complexity**: Requires neural network expertise
158
+ - **Hyperparameter Sensitivity**: Performance depends on architecture choices
159
+ - **Generalization**: May overfit to training families
160
+
161
+ ## Research Context
162
+
163
+ ### Role in Expert Evaluation
164
+ LABin serves as a **custom neural network baseline** in our comparative study:
165
+
166
+ 1. **Custom Architecture**: Alternative to transformer-based approaches
167
+ 2. **Domain-specific Design**: Neural network tailored for domain analysis
168
+ 3. **Performance Comparison**: Effectiveness vs. transformers and traditional ML
169
+ 4. **Computational Analysis**: Resource requirements compared to other approaches
170
+
171
+ ### Comparison Insights
172
+ - **vs. ModernBERT**: Custom architecture vs. pre-trained transformers
173
+ - **vs. Traditional ML**: Neural learning vs. engineered features
174
+ - **vs. LLMs**: Specialized design vs. general-purpose models
175
+
176
+ ## Integration Scenarios
177
+
178
+ ### Deployment Options
179
+ - **Standalone System**: Independent DGA detection service
180
+ - **MoE Component**: Specialized expert in mixture of experts
181
+ - **Ensemble Member**: Part of larger ensemble system
182
+ - **Research Baseline**: Custom neural network reference
183
+
184
+ ### Computational Requirements
185
+ - **Training**: GPU recommended for efficient training
186
+ - **Inference**: CPU/GPU flexible depending on throughput needs
187
+ - **Memory**: Moderate memory requirements
188
+ - **Latency**: Faster than transformers, slower than traditional ML
189
+
190
+ ## Model Versioning
191
+
192
+ ### File Naming Convention
193
+ - **Format**: `LABin_best_model_YYYY-MM-DD_HH_MM_SS.keras`
194
+ - **Current Version**: `LABin_best_model_2025-05-30_15_26_47.keras`
195
+ - **Timestamp**: Training completion time
196
+ - **Selection**: Best performing checkpoint during training
197
+
198
+ ## Citation
199
+
200
+ This LABin model is part of our expert selection research:
201
+
202
+ ```bibtex
203
+ @inproceedings{leyva2024specialized,
204
+ title={Specialized Expert Models for Wordlist-Based DGA Detection: A Mixture of Experts Approach},
205
+ author={Leyva La O, Reynier and Gonzalez, Rodrigo and Catania, Carlos A.},
206
+ booktitle={CACIC 2025},
207
+ year={2024}
208
+ }
209
+ ```
210
+
211
+ ## Development Notes
212
+
213
+ ### Future Improvements
214
+ - **Architecture Optimization**: Explore different neural architectures
215
+ - **Attention Mechanisms**: Incorporate attention for better interpretability
216
+ - **Multi-task Learning**: Joint training on multiple DGA-related tasks
217
+ - **Transfer Learning**: Pre-training on larger domain corpora
218
+
219
+ ### Known Issues
220
+ - **Model Size**: Large file size limits distribution
221
+ - **Training Time**: Longer training compared to traditional ML
222
+ - **Hyperparameter Tuning**: Requires extensive experimentation
223
+ - **Interpretability**: Limited compared to feature-based approaches
224
+
225
+ **Note**: The complete model file is excluded due to GitHub size limitations. See `LARGE_FILES.md` for access instructions.