AetherMind-KD-Student
A Robust and Efficient Knowledge-Distilled Model for Natural Language Inference (NLI)
Repository: samerzaher80/AetherMind-KD-Student
License: MIT
๐ Overview
AetherMind-KD-Student is a 184M-parameter Natural Language Inference (NLI) model distilled from a DeBERTa-v3 teacher using a multi-stage, adversarial-aware knowledge distillation pipeline.
The model is designed to provide:
- High accuracy on standard NLI benchmarks
- Strong robustness on adversarial datasets
- Excellent zero-shot generalization to unseen datasets
- High inference efficiency on consumer GPUs
This makes it suitable for research and practical applications that require fast and reliable sentence-level reasoning.
๐ง Key Features
โ Knowledge Distillation from Large DeBERTa-v3 Teachers
- Teacher: DeBERTa-v3-based NLI model
- Student: 184M-parameter transformer
- Combined objective:
- 70% KLDivLoss on teacher soft logits
- 30% CrossEntropyLoss on gold labels
- Temperature scaling (T โ 3.0) for softened targets
โ Multi-Stage Curriculum
Teacher supervision was applied over a curriculum of NLI datasets:
- SNLI โ core NLI patterns
- MNLI โ multi-domain robustness
- ANLI R1โR3 โ adversarial reasoning
โ Training Enhancements
- BalancedBatchSampler to keep entailment/neutral/contradiction distributions balanced per batch
- Emphasis on contradiction and neutral classes via loss weighting and sampling
- Careful scheduling and early stopping based on validation performance
๐ Datasets
โ Used During Training / Distillation
| Dataset | Role |
|---|---|
| SNLI | Base NLI training (entailment, neutral, contradiction) |
| MNLI | Multi-genre generalization (matched + mismatched) |
| ANLI (R1โR3) | Adversarial robustness and hard examples |
๐ซ Not Used in Training (Zero-Shot Evaluation Only)
The following datasets were not used during training or distillation. All results on them are pure zero-shot:
| Dataset | Type | Notes |
|---|---|---|
| RTE (GLUE) | Textual entailment | Zero-shot generalization |
| HANS | Heuristic / syntactic bias test | Zero-shot |
| SciTail | Science-domain entailment | Evaluated in binary setting |
| XNLI (English) | Cross-lingual NLI test | Zero-shot on English split |
๐ Model Architecture
The model follows a compact transformer architecture:
- 12 Transformer encoder layers
- Hidden size: 768
- 12 attention heads
- Intermediate feed-forward size as in BERT/DeBERTa-base-style models
- Final classification head with 3 output logits:
- 0 = entailment
- 1 = neutral
- 2 = contradiction
Total parameters: 184,424,451
The design target is to match or exceed the performance of larger teacher models while remaining efficient enough for real-time inference on a single consumer GPU.
๐ฅ Knowledge Distillation Strategy
Objective
The total loss is a weighted combination:
- Knowledge Distillation Loss (KLDivLoss)
- Encourages student logits to match the teacherโs softened output distribution
- Supervised Loss (CrossEntropy)
- Encourages correct prediction of the gold label
Formally:
L_total = 0.7 ยท L_KD + 0.3 ยท L_CE
where L_KD uses temperature-scaled teacher logits.
Additional Techniques
- Balanced batches w.r.t. class labels
- Emphasis on contradiction / neutral examples during later stages
- Adversarial samples from ANLI to harden reasoning under distribution shifts
๐ Evaluation Results
1๏ธโฃ Core NLI Benchmarks
| Dataset | Split | Accuracy | Macro-F1 |
|---|---|---|---|
| MNLI (matched) | validation | 90.47% | 90.42% |
| MNLI (mismatched) | validation | 90.12% | 90.07% |
| SNLI | test | ~88โ89% | ~88โ89% |
2๏ธโฃ Adversarial NLI (ANLI)
| Dataset | Split | Accuracy | Macro-F1 |
|---|---|---|---|
| ANLI R1 | test_r1 | 73.60% | 73.61% |
| ANLI R2 | test_r2 | 57.70% | 57.60% |
| ANLI R3 | test_r3 | 53.67% | 53.68% |
These scores indicate strong robustness, especially considering the modelโs size.
3๏ธโฃ Zero-Shot Generalization
These datasets were never seen during training. All scores are zero-shot.
RTE (GLUE)
- Accuracy: 86.28%
- Macro-F1: 86.20%
HANS
- Accuracy: 77.74%
- Macro-F1: 76.60%
The strong performance on HANS suggests reduced dependence on shallow lexical heuristics.
SciTail (Binary Setting)
SciTail originally has entailment vs neutral classes. For evaluation, the modelโs 3-way logits are mapped to:
- Entailment โ entailment
- Neutral + contradiction โ non-entailment
| Split | Accuracy | Macro-F1 |
|---|---|---|
| Train | 82.37% | 80.99% |
| Dev | 78.83% | 78.81% |
XNLI (English, zero-shot)
- Accuracy: 90.92%
- Macro-F1: 90.94%
This demonstrates strong cross-domain and cross-benchmark generalization, even without explicit multilingual or XNLI-specific training.
Results
| Task | Dataset | Split | Accuracy | Macro-F1 |
|---|---|---|---|---|
| Natural Language Inference | MNLI (matched) | validation | 90.47% | 90.42% |
| Natural Language Inference | MNLI (mismatched) | validation | 90.12% | 90.07% |
| Natural Language Inference | SNLI | test | ~88โ89% | ~88โ89% |
| Adversarial NLI | ANLI R1 | test_r1 | 73.60% | 73.61% |
| Adversarial NLI | ANLI R2 | test_r2 | 57.70% | 57.60% |
| Adversarial NLI | ANLI R3 | test_r3 | 53.67% | 53.68% |
| Zero-shot | RTE (GLUE) | validation | 86.28% | 86.20% |
| Zero-shot | HANS | validation | 77.74% | 76.60% |
| Zero-shot (binary) | SciTail | dev | 78.83% | 78.81% |
| Zero-shot | XNLI (English) | test | 90.92% | 90.94% |
โก Efficiency
| Metric | Value |
|---|---|
| Total parameters | 184,424,451 |
| Inference speed | โ 308.51 samples/second |
| Hardware | RTX 3050 (8 GB), CUDA 11.8 |
These numbers make the model a good choice for production environments and large-scale batch inference.
๐งช Intended Use
Recommended Uses
- Research on NLI, robustness, and knowledge distillation
- As a drop-in NLI component for:
- Scientific text understanding
- Claim verification prototypes
- General English reasoning tasks
- Zero-shot probing on new NLI-style benchmarks
Not Recommended For
- Safety-critical applications (medical diagnosis, legal decisions, etc.) without human experts in the loop
- High-stakes multilingual use cases (model is trained and validated on English only)
- Long-document reasoning beyond typical transformer context length
โ Limitations
- Performance on ANLI R3 remains challenging, consistent with broader model behavior in the literature
- No dedicated multilingual training (XNLI non-English languages not evaluated)
- No explicit calibration of probabilities (users may wish to post-calibrate logits/probabilities)
๐ฎ Future Work
Planned and possible future enhancements include:
- Adversarial fine-tuning specifically for ANLI R3
- Cross-lingual extensions using full XNLI
- Domain adapters for biomedical and clinical NLI (e.g., MedNLI)
- Integration in larger cognitive reasoning systems with memory and tool-use (outside the scope of this model card)
๐ฆ Files in This Repository
config.jsonโ model configurationmodel.safetensorsโ model weightstokenizer.jsonโ tokenizer modeltokenizer_config.jsonโ tokenizer configurationspecial_tokens_map.jsonโ special tokens metadataspm.modelโ SentencePiece model (if applicable)added_tokens.jsonโ additional tokens (if any)training_args.binโ training arguments (optional, for reproducibility)trainer_state.jsonโ trainer state (optional, for reproducibility)
๐ป Usage Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "samerzaher80/AetherMind-KD-Student"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
premise = "A cat is sleeping on the sofa."
hypothesis = "An animal is resting indoors."
inputs = tokenizer(premise, hypothesis, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
pred = logits.argmax(dim=-1).item()
id2label = {0: "entailment", 1: "neutral", 2: "contradiction"}
print(id2label[pred])
๐ Citation
If you use this model in your research, please cite:
@misc{aethermind2025kdstudent,
title = {AetherMind-KD-Student: A Robust and Efficient Knowledge-Distilled NLI Model},
author = {Sameer S. Najm},
year = {2025},
howpublished = {Hugging Face model repository},
note = {\url{https://huggingface.co/samerzaher80/AetherMind-KD-Student}}
}
๐ค Author
Sameer S. Najm
AI Researcher & Founder, Sam IT Solutions โ Iraq
๐ชช License
This model is released under the MIT License.
- Downloads last month
- 73