E5-Math-Vietnamese-Binary: Hard Negatives + Loss-based Early Stopping
Model Overview
Fine-tuned E5-base model optimized for exact chunk retrieval in Vietnamese mathematics using:
- 🎯 Binary Classification: Correct vs Incorrect (instead of 3-level hierarchy)
- 💪 Hard Negatives: Related chunks as hard negatives for better discrimination
- ⏰ Loss-based Early Stopping: Stops when validation loss stops improving
- 📊 Comprehensive Evaluation: Hit@K, Accuracy@1, MRR metrics
Performance Summary
Training Results
- Best Validation Loss: N/A
- Training Epochs: 10
- Early Stopping: ❌ Not triggered
- Training Time: 4661.226378917694
Test Performance 🌟 EXCELLENT
Outstanding performance with correct chunks consistently at top positions
| Metric | Base E5 | Fine-tuned | Improvement |
|---|---|---|---|
| MRR | 0.7740 | 0.8505 | +0.0765 |
| Accuracy@1 | 0.6129 | 0.7634 | +0.1505 |
| Hit@1 | 0.6129 | 0.7634 | +0.1505 |
| Hit@3 | 0.9462 | 0.9247 | -0.0215 |
| Hit@5 | 1.0000 | 0.9785 | -0.0215 |
Total Test Queries: 93
Key Innovations
🎯 Binary Classification Approach
Instead of traditional 3-level hierarchy (correct/related/irrelevant), this model uses:
- Correct chunks: Score 1.0 (positive examples)
- Incorrect chunks: Score 0.0 (includes both related and irrelevant)
- Hard negatives: Related chunks serve as challenging negative examples
💪 Hard Negatives Strategy
# Training strategy
positive = correct_chunk # Score: 1.0
hard_negative = related_chunk # Score: 0.0 (but semantically close)
easy_negative = irrelevant_chunk # Score: 0.0 (semantically distant)
# This forces model to learn fine-grained distinctions
⏰ Loss-based Early Stopping
- Monitors validation loss instead of MRR
- Stops when loss stops decreasing (patience=3)
- Prevents overfitting and saves training time
Usage
Basic Usage
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load model
model = SentenceTransformer('ThanhLe0125/ebd-math')
# ⚠️ CRITICAL: Must use E5 prefixes
query = "query: Định nghĩa hàm số đồng biến là gì?"
chunks = [
"passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà...", # Should rank #1
"passage: Ví dụ về hàm số đồng biến: f(x) = 2x + 1...", # Related (trained as hard negative)
"passage: Phương trình bậc hai có dạng ax² + bx + c = 0" # Irrelevant
]
# Encode and rank
query_emb = model.encode([query])
chunk_embs = model.encode(chunks)
similarities = cosine_similarity(query_emb, chunk_embs)[0]
# Get rankings
ranked_indices = similarities.argsort()[::-1]
for rank, idx in enumerate(ranked_indices, 1):
print(f"Rank {rank}: Score {similarities[idx]:.4f} - {chunks[idx][:50]}...")
Advanced Usage with Multiple Queries
def find_best_chunks(queries, chunk_pool, top_k=3):
"""Find best chunks for multiple queries"""
results = []
for query in queries:
# Ensure E5 format
formatted_query = f"query: {query}" if not query.startswith("query:") else query
formatted_chunks = [f"passage: {chunk}" if not chunk.startswith("passage:") else chunk
for chunk in chunk_pool]
# Encode
query_emb = model.encode([formatted_query])
chunk_embs = model.encode(formatted_chunks)
similarities = cosine_similarity(query_emb, chunk_embs)[0]
# Get top K
top_indices = similarities.argsort()[::-1][:top_k]
top_chunks = [
{
'chunk': chunk_pool[i],
'similarity': similarities[i],
'rank': rank + 1
}
for rank, i in enumerate(top_indices)
]
results.append({
'query': query,
'top_chunks': top_chunks
})
return results
# Example
queries = [
"Công thức tính đạo hàm của hàm hợp",
"Cách giải phương trình bậc hai",
"Định nghĩa giới hạn của hàm số"
]
chunk_pool = [
"Đạo hàm của hàm hợp: (f(g(x)))' = f'(g(x)) × g'(x)",
"Giải phương trình bậc hai bằng công thức nghiệm",
"Giới hạn của hàm số tại một điểm",
# ... more chunks
]
results = find_best_chunks(queries, chunk_pool, top_k=3)
Training Details
Dataset
- Domain: Vietnamese mathematics education
- Split: Train/Validation/Test with proper separation
- Hard Negatives: Related mathematical concepts as challenging negatives
- Easy Negatives: Unrelated mathematical concepts
Training Configuration
Config:
base_model = "intfloat/multilingual-e5-base"
train_batch_size = 4
learning_rate = 2e-5
max_epochs = 10
early_stopping_patience = 3
loss_function = "MultipleNegativesRankingLoss"
evaluation_metric = "validation_loss"
Evaluation Methodology
- Training: Binary classification with hard negatives
- Validation: Loss-based monitoring for early stopping
- Testing: Comprehensive evaluation with restored 3-level hierarchy
- Metrics: Hit@K, Accuracy@1, MRR comparison vs base model
Model Architecture
- Base: intfloat/multilingual-e5-base
- Max Sequence Length: 256 tokens
- Output Dimension: 768
- Similarity: Cosine similarity
- Training Loss: MultipleNegativesRankingLoss
Use Cases
- ✅ Educational Q&A: Find exact mathematical definitions and explanations
- ✅ Content Retrieval: Precise chunk retrieval for Vietnamese math content
- ✅ Tutoring Systems: Quick and accurate answer finding
- ✅ Knowledge Base Search: Efficient mathematical concept lookup
Performance Interpretation
- Hit@1 ≥ 0.7: 🌟 Excellent - Correct answer usually at #1
- Hit@3 ≥ 0.8: 🎯 Very Good - Correct answer in top 3
- MRR ≥ 0.7: 👍 Good - Low average rank for correct answers
- Accuracy@1 ≥ 0.6: ✅ Solid - Good precision for top result
Limitations
- Vietnamese-specific: Optimized for Vietnamese mathematical terminology
- Domain-specific: Best performance on educational math content
- Sequence length: Limited to 256 tokens
- E5 format required: Must use "query:" and "passage:" prefixes
Citation
@model{e5-math-vietnamese-binary,
title={E5-Math-Vietnamese-Binary: Hard Negatives Fine-tuning for Mathematical Retrieval},
author={ThanhLe0125},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/ThanhLe0125/ebd-math}
}
Trained on July 01, 2025 using hard negatives and loss-based early stopping for optimal retrieval performance.
- Downloads last month
- 3
Model tree for ThanhLe0125/ebd-math
Base model
intfloat/multilingual-e5-base