AetherMind-KD-Student

A Robust and Efficient Knowledge-Distilled Model for Natural Language Inference (NLI)

Repository: samerzaher80/AetherMind-KD-Student
License: MIT


๐Ÿ“˜ Overview

AetherMind-KD-Student is a 184M-parameter Natural Language Inference (NLI) model distilled from a DeBERTa-v3 teacher using a multi-stage, adversarial-aware knowledge distillation pipeline.
The model is designed to provide:

  • High accuracy on standard NLI benchmarks
  • Strong robustness on adversarial datasets
  • Excellent zero-shot generalization to unseen datasets
  • High inference efficiency on consumer GPUs

This makes it suitable for research and practical applications that require fast and reliable sentence-level reasoning.


๐Ÿง  Key Features

โœ” Knowledge Distillation from Large DeBERTa-v3 Teachers

  • Teacher: DeBERTa-v3-based NLI model
  • Student: 184M-parameter transformer
  • Combined objective:
    • 70% KLDivLoss on teacher soft logits
    • 30% CrossEntropyLoss on gold labels
  • Temperature scaling (T โ‰ˆ 3.0) for softened targets

โœ” Multi-Stage Curriculum

Teacher supervision was applied over a curriculum of NLI datasets:

  1. SNLI โ€“ core NLI patterns
  2. MNLI โ€“ multi-domain robustness
  3. ANLI R1โ€“R3 โ€“ adversarial reasoning

โœ” Training Enhancements

  • BalancedBatchSampler to keep entailment/neutral/contradiction distributions balanced per batch
  • Emphasis on contradiction and neutral classes via loss weighting and sampling
  • Careful scheduling and early stopping based on validation performance

๐Ÿ“š Datasets

โœ… Used During Training / Distillation

Dataset Role
SNLI Base NLI training (entailment, neutral, contradiction)
MNLI Multi-genre generalization (matched + mismatched)
ANLI (R1โ€“R3) Adversarial robustness and hard examples

๐Ÿšซ Not Used in Training (Zero-Shot Evaluation Only)

The following datasets were not used during training or distillation. All results on them are pure zero-shot:

Dataset Type Notes
RTE (GLUE) Textual entailment Zero-shot generalization
HANS Heuristic / syntactic bias test Zero-shot
SciTail Science-domain entailment Evaluated in binary setting
XNLI (English) Cross-lingual NLI test Zero-shot on English split

๐Ÿ— Model Architecture

The model follows a compact transformer architecture:

  • 12 Transformer encoder layers
  • Hidden size: 768
  • 12 attention heads
  • Intermediate feed-forward size as in BERT/DeBERTa-base-style models
  • Final classification head with 3 output logits:
    • 0 = entailment
    • 1 = neutral
    • 2 = contradiction

Total parameters: 184,424,451

The design target is to match or exceed the performance of larger teacher models while remaining efficient enough for real-time inference on a single consumer GPU.


๐Ÿ”ฅ Knowledge Distillation Strategy

Objective

The total loss is a weighted combination:

  • Knowledge Distillation Loss (KLDivLoss)
    • Encourages student logits to match the teacherโ€™s softened output distribution
  • Supervised Loss (CrossEntropy)
    • Encourages correct prediction of the gold label

Formally:

L_total = 0.7 ยท L_KD + 0.3 ยท L_CE

where L_KD uses temperature-scaled teacher logits.

Additional Techniques

  • Balanced batches w.r.t. class labels
  • Emphasis on contradiction / neutral examples during later stages
  • Adversarial samples from ANLI to harden reasoning under distribution shifts

๐Ÿ“Š Evaluation Results

1๏ธโƒฃ Core NLI Benchmarks

Dataset Split Accuracy Macro-F1
MNLI (matched) validation 90.47% 90.42%
MNLI (mismatched) validation 90.12% 90.07%
SNLI test ~88โ€“89% ~88โ€“89%

2๏ธโƒฃ Adversarial NLI (ANLI)

Dataset Split Accuracy Macro-F1
ANLI R1 test_r1 73.60% 73.61%
ANLI R2 test_r2 57.70% 57.60%
ANLI R3 test_r3 53.67% 53.68%

These scores indicate strong robustness, especially considering the modelโ€™s size.


3๏ธโƒฃ Zero-Shot Generalization

These datasets were never seen during training. All scores are zero-shot.

RTE (GLUE)

  • Accuracy: 86.28%
  • Macro-F1: 86.20%

HANS

  • Accuracy: 77.74%
  • Macro-F1: 76.60%

The strong performance on HANS suggests reduced dependence on shallow lexical heuristics.

SciTail (Binary Setting)

SciTail originally has entailment vs neutral classes. For evaluation, the modelโ€™s 3-way logits are mapped to:

  • Entailment โ†’ entailment
  • Neutral + contradiction โ†’ non-entailment
Split Accuracy Macro-F1
Train 82.37% 80.99%
Dev 78.83% 78.81%

XNLI (English, zero-shot)

  • Accuracy: 90.92%
  • Macro-F1: 90.94%

This demonstrates strong cross-domain and cross-benchmark generalization, even without explicit multilingual or XNLI-specific training.

Results

Task Dataset Split Accuracy Macro-F1
Natural Language Inference MNLI (matched) validation 90.47% 90.42%
Natural Language Inference MNLI (mismatched) validation 90.12% 90.07%
Natural Language Inference SNLI test ~88โ€“89% ~88โ€“89%
Adversarial NLI ANLI R1 test_r1 73.60% 73.61%
Adversarial NLI ANLI R2 test_r2 57.70% 57.60%
Adversarial NLI ANLI R3 test_r3 53.67% 53.68%
Zero-shot RTE (GLUE) validation 86.28% 86.20%
Zero-shot HANS validation 77.74% 76.60%
Zero-shot (binary) SciTail dev 78.83% 78.81%
Zero-shot XNLI (English) test 90.92% 90.94%

โšก Efficiency

Metric Value
Total parameters 184,424,451
Inference speed โ‰ˆ 308.51 samples/second
Hardware RTX 3050 (8 GB), CUDA 11.8

These numbers make the model a good choice for production environments and large-scale batch inference.


๐Ÿงช Intended Use

Recommended Uses

  • Research on NLI, robustness, and knowledge distillation
  • As a drop-in NLI component for:
    • Scientific text understanding
    • Claim verification prototypes
    • General English reasoning tasks
  • Zero-shot probing on new NLI-style benchmarks

Not Recommended For

  • Safety-critical applications (medical diagnosis, legal decisions, etc.) without human experts in the loop
  • High-stakes multilingual use cases (model is trained and validated on English only)
  • Long-document reasoning beyond typical transformer context length

โš  Limitations

  • Performance on ANLI R3 remains challenging, consistent with broader model behavior in the literature
  • No dedicated multilingual training (XNLI non-English languages not evaluated)
  • No explicit calibration of probabilities (users may wish to post-calibrate logits/probabilities)

๐Ÿ”ฎ Future Work

Planned and possible future enhancements include:

  • Adversarial fine-tuning specifically for ANLI R3
  • Cross-lingual extensions using full XNLI
  • Domain adapters for biomedical and clinical NLI (e.g., MedNLI)
  • Integration in larger cognitive reasoning systems with memory and tool-use (outside the scope of this model card)

๐Ÿ“ฆ Files in This Repository

  • config.json โ€“ model configuration
  • model.safetensors โ€“ model weights
  • tokenizer.json โ€“ tokenizer model
  • tokenizer_config.json โ€“ tokenizer configuration
  • special_tokens_map.json โ€“ special tokens metadata
  • spm.model โ€“ SentencePiece model (if applicable)
  • added_tokens.json โ€“ additional tokens (if any)
  • training_args.bin โ€“ training arguments (optional, for reproducibility)
  • trainer_state.json โ€“ trainer state (optional, for reproducibility)

๐Ÿ’ป Usage Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "samerzaher80/AetherMind-KD-Student"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "A cat is sleeping on the sofa."
hypothesis = "An animal is resting indoors."

inputs = tokenizer(premise, hypothesis, return_tensors="pt")
outputs = model(**inputs)

logits = outputs.logits
pred = logits.argmax(dim=-1).item()

id2label = {0: "entailment", 1: "neutral", 2: "contradiction"}
print(id2label[pred])

๐Ÿ“œ Citation

If you use this model in your research, please cite:

@misc{aethermind2025kdstudent,
  title        = {AetherMind-KD-Student: A Robust and Efficient Knowledge-Distilled NLI Model},
  author       = {Sameer S. Najm},
  year         = {2025},
  howpublished = {Hugging Face model repository},
  note         = {\url{https://huggingface.co/samerzaher80/AetherMind-KD-Student}}
}

๐Ÿ‘ค Author

Sameer S. Najm
AI Researcher & Founder, Sam IT Solutions โ€“ Iraq


๐Ÿชช License

This model is released under the MIT License.

Downloads last month
73
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Datasets used to train samerzaher80/AetherMind-KD-Student