E5-Math-Vietnamese-Binary: Hard Negatives + Loss-based Early Stopping

Model Overview

Fine-tuned E5-base model optimized for exact chunk retrieval in Vietnamese mathematics using:

  • 🎯 Binary Classification: Correct vs Incorrect (instead of 3-level hierarchy)
  • 💪 Hard Negatives: Related chunks as hard negatives for better discrimination
  • ⏰ Loss-based Early Stopping: Stops when validation loss stops improving
  • 📊 Comprehensive Evaluation: Hit@K, Accuracy@1, MRR metrics

Performance Summary

Training Results

  • Best Validation Loss: N/A
  • Training Epochs: 10
  • Early Stopping: ❌ Not triggered
  • Training Time: 4661.226378917694

Test Performance 🌟 EXCELLENT

Outstanding performance with correct chunks consistently at top positions

Metric Base E5 Fine-tuned Improvement
MRR 0.7740 0.8505 +0.0765
Accuracy@1 0.6129 0.7634 +0.1505
Hit@1 0.6129 0.7634 +0.1505
Hit@3 0.9462 0.9247 -0.0215
Hit@5 1.0000 0.9785 -0.0215

Total Test Queries: 93

Key Innovations

🎯 Binary Classification Approach

Instead of traditional 3-level hierarchy (correct/related/irrelevant), this model uses:

  • Correct chunks: Score 1.0 (positive examples)
  • Incorrect chunks: Score 0.0 (includes both related and irrelevant)
  • Hard negatives: Related chunks serve as challenging negative examples

💪 Hard Negatives Strategy

# Training strategy
positive = correct_chunk           # Score: 1.0
hard_negative = related_chunk      # Score: 0.0 (but semantically close)
easy_negative = irrelevant_chunk   # Score: 0.0 (semantically distant)

# This forces model to learn fine-grained distinctions

⏰ Loss-based Early Stopping

  • Monitors validation loss instead of MRR
  • Stops when loss stops decreasing (patience=3)
  • Prevents overfitting and saves training time

Usage

Basic Usage

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load model
model = SentenceTransformer('ThanhLe0125/ebd-math')

# ⚠️ CRITICAL: Must use E5 prefixes
query = "query: Định nghĩa hàm số đồng biến là gì?"
chunks = [
    "passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà...",  # Should rank #1
    "passage: Ví dụ về hàm số đồng biến: f(x) = 2x + 1...",         # Related (trained as hard negative)
    "passage: Phương trình bậc hai có dạng ax² + bx + c = 0"        # Irrelevant
]

# Encode and rank
query_emb = model.encode([query])
chunk_embs = model.encode(chunks)
similarities = cosine_similarity(query_emb, chunk_embs)[0]

# Get rankings
ranked_indices = similarities.argsort()[::-1]
for rank, idx in enumerate(ranked_indices, 1):
    print(f"Rank {rank}: Score {similarities[idx]:.4f} - {chunks[idx][:50]}...")

Advanced Usage with Multiple Queries

def find_best_chunks(queries, chunk_pool, top_k=3):
    """Find best chunks for multiple queries"""
    results = []
    
    for query in queries:
        # Ensure E5 format
        formatted_query = f"query: {query}" if not query.startswith("query:") else query
        formatted_chunks = [f"passage: {chunk}" if not chunk.startswith("passage:") else chunk 
                          for chunk in chunk_pool]
        
        # Encode
        query_emb = model.encode([formatted_query])
        chunk_embs = model.encode(formatted_chunks)
        similarities = cosine_similarity(query_emb, chunk_embs)[0]
        
        # Get top K
        top_indices = similarities.argsort()[::-1][:top_k]
        top_chunks = [
            {
                'chunk': chunk_pool[i],
                'similarity': similarities[i],
                'rank': rank + 1
            }
            for rank, i in enumerate(top_indices)
        ]
        
        results.append({
            'query': query,
            'top_chunks': top_chunks
        })
    
    return results

# Example
queries = [
    "Công thức tính đạo hàm của hàm hợp",
    "Cách giải phương trình bậc hai", 
    "Định nghĩa giới hạn của hàm số"
]

chunk_pool = [
    "Đạo hàm của hàm hợp: (f(g(x)))' = f'(g(x)) × g'(x)",
    "Giải phương trình bậc hai bằng công thức nghiệm",
    "Giới hạn của hàm số tại một điểm",
    # ... more chunks
]

results = find_best_chunks(queries, chunk_pool, top_k=3)

Training Details

Dataset

  • Domain: Vietnamese mathematics education
  • Split: Train/Validation/Test with proper separation
  • Hard Negatives: Related mathematical concepts as challenging negatives
  • Easy Negatives: Unrelated mathematical concepts

Training Configuration

Config:
    base_model = "intfloat/multilingual-e5-base"
    train_batch_size = 4
    learning_rate = 2e-5
    max_epochs = 10
    early_stopping_patience = 3
    loss_function = "MultipleNegativesRankingLoss"
    evaluation_metric = "validation_loss"

Evaluation Methodology

  1. Training: Binary classification with hard negatives
  2. Validation: Loss-based monitoring for early stopping
  3. Testing: Comprehensive evaluation with restored 3-level hierarchy
  4. Metrics: Hit@K, Accuracy@1, MRR comparison vs base model

Model Architecture

  • Base: intfloat/multilingual-e5-base
  • Max Sequence Length: 256 tokens
  • Output Dimension: 768
  • Similarity: Cosine similarity
  • Training Loss: MultipleNegativesRankingLoss

Use Cases

  • Educational Q&A: Find exact mathematical definitions and explanations
  • Content Retrieval: Precise chunk retrieval for Vietnamese math content
  • Tutoring Systems: Quick and accurate answer finding
  • Knowledge Base Search: Efficient mathematical concept lookup

Performance Interpretation

  • Hit@1 ≥ 0.7: 🌟 Excellent - Correct answer usually at #1
  • Hit@3 ≥ 0.8: 🎯 Very Good - Correct answer in top 3
  • MRR ≥ 0.7: 👍 Good - Low average rank for correct answers
  • Accuracy@1 ≥ 0.6: ✅ Solid - Good precision for top result

Limitations

  • Vietnamese-specific: Optimized for Vietnamese mathematical terminology
  • Domain-specific: Best performance on educational math content
  • Sequence length: Limited to 256 tokens
  • E5 format required: Must use "query:" and "passage:" prefixes

Citation

@model{e5-math-vietnamese-binary,
  title={E5-Math-Vietnamese-Binary: Hard Negatives Fine-tuning for Mathematical Retrieval},
  author={ThanhLe0125},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ThanhLe0125/ebd-math}
}

Trained on July 01, 2025 using hard negatives and loss-based early stopping for optimal retrieval performance.

Downloads last month
3
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ThanhLe0125/ebd-math

Finetuned
(101)
this model