|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: HuggingFaceTB/SmolLM2-360M-Instruct |
|
|
tags: |
|
|
- security |
|
|
- log-analysis |
|
|
- threat-detection |
|
|
- nginx |
|
|
- text-classification |
|
|
- lora |
|
|
- cpu |
|
|
- llama-cpp |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
pipeline_tag: text-classification |
|
|
datasets: |
|
|
- nginx_security |
|
|
metrics: |
|
|
- accuracy |
|
|
model-index: |
|
|
- name: SecInt-SmolLM2-360M-nginx |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Security Log Classification |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 99.0 |
|
|
name: Accuracy |
|
|
--- |
|
|
|
|
|
# SecInt-SmolLM2-360M-nginx |
|
|
|
|
|
**SecInt** (Security Intelligence Monitor) is a fine-tuned SmolLM2-360M model for real-time nginx security log classification. This is the first model in the SecInt series, designed to automatically detect security threats, errors, and normal traffic patterns in web server logs. |
|
|
|
|
|
**There are 2 GGUF models, try version 04 its been trained on a lot more data. |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
- **Base Model**: [HuggingFaceTB/SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) |
|
|
- **Model Size**: 360M parameters (~691MB) |
|
|
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) |
|
|
- **Task**: Multi-class text classification (3 classes) |
|
|
- **Classes**: `hack`, `error`, `normal` |
|
|
- **Inference**: CPU-optimized (~2GB RAM, 32 tokens/sec) |
|
|
- **Format**: Safetensors + GGUF (llama.cpp compatible) |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **99%+ Accuracy** on production security logs |
|
|
- **Real-time Detection**: ~100ms latency per classification |
|
|
- **CPU Inference**: No GPU required, runs on any system |
|
|
- **Production-Tested**: Battle-tested since October 2025, processing logs from 8 domains |
|
|
- **Lightweight**: Only ~2GB RAM needed |
|
|
- **Fast**: 32 tokens/second on CPU |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Using Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "LeviDeHaan/SecInt-SmolLM2-360M-nginx" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
|
|
|
|
# Example log entry |
|
|
log_entry = '192.168.1.100 - - [28/Oct/2025:12:34:56 +0000] "GET /.env HTTP/1.1" 404 162 "-" "curl/7.68.0"' |
|
|
|
|
|
# System prompt with classification rules |
|
|
system_prompt = """You are a security log analyzer. Classify the log entry as one of: hack, error, or normal. |
|
|
|
|
|
HACK - Any of these patterns indicate an attack: |
|
|
- Scanning for sensitive files: .env, .git, .php, config.php, wp-admin, phpmyadmin |
|
|
- SQL injection attempts, XSS attempts |
|
|
- Invalid login attempts, brute force, "invalid user", "failed password" |
|
|
- Exploit attempts: /cgi-bin/, shell commands, malformed requests |
|
|
- 403/404 errors with suspicious paths |
|
|
- "access forbidden by rule" with .env, .git, admin, wp-, .php |
|
|
- Scanner user-agents: sqlmap, nikto, zgrab, nuclei |
|
|
- Webshell access attempts |
|
|
|
|
|
ERROR - Application errors: |
|
|
- 500 errors, crashes, exceptions |
|
|
- SSL/TLS errors |
|
|
- Database connection failures |
|
|
- [emerg], [alert], [crit], [error] log levels |
|
|
|
|
|
NORMAL - Everything else: |
|
|
- 200/304 responses to legitimate paths |
|
|
- Regular API calls, static files |
|
|
- Known good bots: googlebot, facebookbot |
|
|
|
|
|
Respond with only one word: hack, error, or normal.""" |
|
|
|
|
|
# Format prompt using chat template |
|
|
messages = [ |
|
|
{"role": "system", "content": system_prompt}, |
|
|
{"role": "user", "content": f"Classify this log entry as hack, error, or normal.\n\n{log_entry}"} |
|
|
] |
|
|
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
|
|
|
# Generate classification |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=10, |
|
|
temperature=0.01, |
|
|
top_p=0.38, |
|
|
top_k=10, |
|
|
do_sample=True, |
|
|
pad_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
# Extract result |
|
|
result = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).strip() |
|
|
print(f"Classification: {result}") # Output: hack |
|
|
``` |
|
|
|
|
|
### Using llama.cpp |
|
|
|
|
|
The model includes a GGUF file for efficient CPU inference: |
|
|
|
|
|
```bash |
|
|
# Download the GGUF model |
|
|
huggingface-cli download LeviDeHaan/SecInt-SmolLM2-360M-nginx smollm-security-nginx02-merged.gguf |
|
|
|
|
|
# Run inference with llama.cpp |
|
|
./llama-cli -m smollm-security-nginx02-merged.gguf \ |
|
|
--temp 0.01 \ |
|
|
--top-p 0.38 \ |
|
|
--top-k 10 \ |
|
|
--seed 42 \ |
|
|
-p "<|im_start|>system\nYou are a security log analyzer...<|im_end|>\n<|im_start|>user\nClassify this log entry...<|im_end|>\n<|im_start|>assistant\n" |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
|
|
|
- **Source**: Real production nginx logs from 8 domains |
|
|
- **Total Examples**: 1,646 labeled samples |
|
|
- **Class Distribution**: |
|
|
- `hack`: 800 examples (48.6%) - SQL injection, path traversal, scanner activity, exploit attempts |
|
|
- `error`: 46 examples (2.8%) - 500 errors, SSL failures, application crashes |
|
|
- `normal`: 800 examples (48.6%) - Legitimate traffic, API calls, static file requests |
|
|
|
|
|
### LoRA Configuration |
|
|
|
|
|
```yaml |
|
|
LoRA Rank (r): 8 |
|
|
LoRA Alpha: 16 |
|
|
LoRA Dropout: 0.05 |
|
|
Target Modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj |
|
|
RSLoRA: enabled |
|
|
``` |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
```yaml |
|
|
Learning Rate: 2e-05 |
|
|
Scheduler: cosine_with_restarts |
|
|
Warmup Steps: 5 |
|
|
Batch Size: 10 per device |
|
|
Gradient Accumulation: 8 steps |
|
|
Effective Batch Size: 80 |
|
|
Epochs: 10 |
|
|
Max Sequence Length: 2048 tokens |
|
|
Optimizer: AdamW (betas=0.9,0.999, eps=1e-08) |
|
|
Seed: 42 |
|
|
``` |
|
|
|
|
|
### Training Results |
|
|
|
|
|
- **Training Duration**: ~50 minutes (210 steps) |
|
|
- **Final Loss**: 0.2575 |
|
|
- **Throughput**: 3,121 tokens/second |
|
|
- **Total Tokens**: 9.29M |
|
|
- **Hardware**: CPU training (no GPU required) |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
### Real-time Web Server Security Monitoring |
|
|
|
|
|
SecInt is designed for integration into security monitoring systems to provide automated threat detection: |
|
|
|
|
|
1. **Log Ingestion**: Monitor nginx access/error logs |
|
|
2. **Classification**: Identify attacks, errors, and normal traffic |
|
|
3. **Alerting**: Trigger notifications for security threats |
|
|
4. **Analytics**: Track attack patterns and trends |
|
|
5. **Response**: Feed into incident response workflows |
|
|
|
|
|
### Typical Integration Architecture |
|
|
|
|
|
``` |
|
|
nginx logs → Log Parser → SecInt Classifier → Alert System |
|
|
↓ |
|
|
Database Storage → Dashboard |
|
|
``` |
|
|
|
|
|
### Detection Capabilities |
|
|
|
|
|
The model can identify: |
|
|
|
|
|
**Attack Patterns (hack)**: |
|
|
- File/directory scanning (`.env`, `.git`, `config.php`, `wp-admin`, `phpmyadmin`) |
|
|
- SQL injection (`UNION SELECT`, `OR 1=1`, etc.) |
|
|
- Cross-site scripting (XSS) attempts |
|
|
- Path traversal (`../../../`) |
|
|
- Command injection attempts |
|
|
- Known exploit attempts (PHPUnit RCE, ThinkPHP, etc.) |
|
|
- Webshell access (c99, r57, alfa, wso) |
|
|
- Scanner signatures (sqlmap, nikto, zgrab, nuclei) |
|
|
- Brute force attacks (failed passwords, invalid users) |
|
|
- Request obfuscation (null bytes, encoding tricks) |
|
|
|
|
|
**Application Errors (error)**: |
|
|
- HTTP 500 errors |
|
|
- SSL/TLS handshake failures |
|
|
- Application crashes and exceptions |
|
|
- Database connection errors |
|
|
- Critical log levels ([emerg], [alert], [crit]) |
|
|
|
|
|
**Normal Traffic (normal)**: |
|
|
- HTTP 200/304 responses to legitimate paths |
|
|
- API endpoints and authenticated requests |
|
|
- Static file serving (CSS, JS, images) |
|
|
- Known good bots (Googlebot, etc.) |
|
|
|
|
|
## Performance Metrics |
|
|
|
|
|
### Optimization Features |
|
|
|
|
|
When deployed in the full SecInt system: |
|
|
- **Intelligent Caching**: 95%+ cache hit rate reduces redundant LLM calls |
|
|
- **Session Tracking**: Sampling mode after 50 requests from same IP |
|
|
- **Whitelist Support**: Known-good traffic bypasses classification |
|
|
- **Batch Processing**: Groups requests for efficient processing |
|
|
|
|
|
## Recommended Inference Settings |
|
|
|
|
|
For optimal security classification results: |
|
|
|
|
|
```python |
|
|
temperature = 0.01 # Very deterministic |
|
|
max_tokens = 1024 # Classification is short |
|
|
top_k = 10 # Limit vocabulary |
|
|
top_p = 0.38 # Nucleus sampling |
|
|
seed = 42 # Fixed for consistency |
|
|
``` |
|
|
|
|
|
These settings ensure consistent, deterministic classification suitable for production security monitoring. |
|
|
|
|
|
## Prompt Template |
|
|
|
|
|
The model requires the SmolLM2 chat template format. **Critical**: Use the exact system prompt shown in the Quick Start section for best results. The system prompt contains: |
|
|
|
|
|
1. Clear task definition |
|
|
2. Detailed attack pattern definitions (HACK class) |
|
|
3. Error pattern definitions (ERROR class) |
|
|
4. Normal traffic definitions (NORMAL class) |
|
|
5. Instruction to respond with single word only |
|
|
|
|
|
Deviation from this prompt format may significantly reduce accuracy. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **nginx-Specific**: Trained exclusively on nginx log format; may require fine-tuning for Apache, IIS, or other web servers |
|
|
- **Prompt-Dependent**: Requires exact prompt template for optimal performance |
|
|
- **CPU Inference**: Optimized for CPU; no GPU-specific optimizations |
|
|
- **English Only**: Trained on English-language logs |
|
|
- **Context Length**: Limited to 2048 tokens per log entry |
|
|
- **No Multi-log Context**: Classifies individual log entries; does not correlate across multiple logs |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
Built on SmolLM2-360M-Instruct, a decoder-only transformer model optimized for instruction following: |
|
|
|
|
|
- **Parameters**: 360M |
|
|
- **Architecture**: Transformer decoder with grouped-query attention |
|
|
- **Context Length**: 2048 tokens |
|
|
- **Vocabulary Size**: 49,152 tokens |
|
|
- **Base Training**: Pre-trained on diverse text corpus, instruction-tuned |
|
|
|
|
|
LoRA fine-tuning targets all attention and MLP projection layers for maximum adaptation to security log classification while maintaining base model knowledge. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research or production systems, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{secint-smollm2-nginx, |
|
|
author = {Levi DeHaan}, |
|
|
title = {SecInt: SmolLM2-360M Fine-tuned for nginx Security Log Classification}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/LeviDeHaan/SecInt-SmolLM2-360M-nginx}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **HuggingFace** for the SmolLM2-360M-Instruct base model |
|
|
- **llama.cpp** team for efficient CPU inference capabilities |
|
|
- **LLaMA-Factory** for streamlined LoRA fine-tuning framework |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under Apache 2.0 license, consistent with the base SmolLM2 model. You are free to use, modify, and distribute this model for commercial and non-commercial purposes. |
|
|
|
|
|
## Project |
|
|
|
|
|
SecInt is part of the **Security Intelligence Monitor v2** project, a comprehensive real-time security monitoring system for web servers. The full system includes: |
|
|
|
|
|
- Multi-format log ingestion (nginx, Apache, custom) |
|
|
- AI-powered threat classification |
|
|
- Threat intelligence enrichment (GeoIP, Shodan) |
|
|
- Breach detection (7+ detection rules) |
|
|
- Real-time alerting (Pushover, email, webhooks) |
|
|
- Interactive dashboard (Streamlit) |
|
|
- Attack session management |
|
|
- SQLite-based persistence and analytics |
|
|
|
|
|
For more information about the full SecInt system, visit: [logwatcher project](https://levidehaan.com/projects) |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions, issues, or collaboration opportunities: |
|
|
- **Hugging Face**: [@LeviDeHaan](https://huggingface.co/LeviDeHaan) |
|
|
- **Model Repository**: [SecInt-SmolLM2-360M-nginx](https://huggingface.co/LeviDeHaan/SecInt-SmolLM2-360M-nginx) |
|
|
|