DeAR-3B-Reranker-CE-v1

Model Description

DeAR-3B-Reranker-CE-v1 is a 3B parameter efficient neural reranker trained with Binary Cross-Entropy loss and knowledge distillation. This model provides fast, reliable reranking for production environments where speed and efficiency are critical.

Model Details

  • Model Type: Pointwise Reranker (Binary Classification)
  • Base Model: LLaMA-3.2-3B
  • Parameters: 3 billion
  • Training Method: Knowledge Distillation + Binary Cross-Entropy
  • Teacher Model: LLaMA2-13B-RankLLaMA
  • Training Data: MS MARCO + DeAR-COT
  • Model Size: 6GB (BF16)

Key Features

βœ… Ultra Fast: 1.5s inference (best in DeAR family)
βœ… Memory Efficient: Runs on single 16GB GPU
βœ… Production Ready: Stable training with BCE loss
βœ… Cost Effective: Lower computational costs
βœ… Binary Classification: Probabilistic relevance scores

Usage

Quick Start

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model
model_path = "abdoelsayed/dear-3b-reranker-ce-v1"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16
)
model.eval().cuda()

# Score a query-document pair
query = "What is llama?"
document = "The llama is a domesticated South American camelid..."

inputs = tokenizer(
    f"query: {query}",
    f"document: {document}",
    return_tensors="pt",
    truncation=True,
    max_length=228,
    padding="max_length"
)
inputs = {k: v.cuda() for k, v in inputs.items()}

with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()
    
print(f"Relevance score: {score}")

Efficient Batch Processing

import torch
from typing import List, Tuple

@torch.inference_mode()
def fast_rerank(tokenizer, model, query: str, docs: List[Tuple[str, str]], batch_size: int = 128):
    """Fast reranking optimized for 3B model."""
    device = next(model.parameters()).device
    scores = []
    
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i + batch_size]
        
        # Prepare batch
        queries = [f"query: {query}"] * len(batch)
        documents = [f"document: {t} {p}" for t, p in batch]
        
        # Tokenize
        inputs = tokenizer(
            queries,
            documents,
            return_tensors="pt",
            truncation=True,
            max_length=228,
            padding=True
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Score
        logits = model(**inputs).logits.squeeze(-1)
        scores.extend(logits.cpu().tolist())
    
    # Rank
    return sorted(enumerate(scores), key=lambda x: x[1], reverse=True)


# Example
query = "When did Thomas Edison invent the light bulb?"
docs = [
    ("", "Thomas Edison invented the light bulb in 1879"),
    ("", "Coffee is good for diet"),
    ("", "Lightning strike at Seoul National University"),
]

ranking = fast_rerank(tokenizer, model, query, docs, batch_size=128)
print(ranking)
# DeAR-P-3B-BC Output:
# [(0, -6.0625), (2, -11.125), (1, -12.0625)]

Production Optimization

# Optimize for maximum throughput
model = AutoModelForSequenceClassification.from_pretrained(
    "abdoelsayed/dear-3b-reranker-ce-v1",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model.eval()

# Compile for 20-30% speedup (PyTorch 2.0+)
if hasattr(torch, 'compile'):
    model = torch.compile(model, mode="max-autotune")

# Use larger batches for throughput
batch_size = 128  # 3B can handle larger batches

Training Details

Training Configuration

{
    "base_model": "meta-llama/Llama-3.2-3B",
    "teacher_model": "abdoelsayed/llama2-13b-rankllama-teacher",
    "loss": "Binary Cross-Entropy",
    "distillation": {
        "temperature": 2.0,
        "alpha": 0.1
    },
    "learning_rate": 1e-4,
    "batch_size": 4,
    "gradient_accumulation": 2,
    "epochs": 2,
    "max_length": 228,
    "bf16": true
}

Hardware

  • GPUs: 4x NVIDIA A100 (40GB)
  • Training Time: ~17 hours
  • Memory Usage: ~24GB per GPU
  • Trainable Parameters: 3B

Evaluation Results

TREC Deep Learning

Dataset NDCG@10 NDCG@20 MRR@10
DL19 70.8 67.3 83.9
DL20 68.9 65.8 81.7

BEIR Benchmark

Dataset NDCG@10
MS MARCO 65.3
NQ 48.7
HotpotQA 57.9
FiQA 43.6
ArguAna 55.8
SciFact 70.2
TREC-COVID 81.8
NFCorpus 37.2
Average 41.7

Efficiency

Metric 3B-CE 8B-CE Improvement
Inference (100 docs) 1.5s 2.2s 1.5x faster
Throughput 67 docs/s 45 docs/s 1.5x
GPU Memory 12GB 18GB 33% less
Model Size 6GB 16GB 62% smaller

Comparison

vs. Other 3B Models

Model Loss DL19 DL20 Speed (s)
DeAR-3B-CE BCE 70.8 68.9 1.5
DeAR-3B-RankNet RankNet 71.2 69.4 1.5
MonoT5-3B - 71.8 68.9 3.5

Key Advantages:

  • 2.3x faster than MonoT5-3B
  • Comparable accuracy
  • More stable training (BCE vs complex losses)

When to Use

Best for:

  • βœ… High-throughput production systems
  • βœ… Real-time applications (latency <2s)
  • βœ… Cost-sensitive deployments
  • βœ… Edge deployment (smaller GPUs)
  • βœ… Binary relevance tasks

Consider alternatives for:

  • ❌ Maximum accuracy (use 8B models)
  • ❌ Complex reasoning queries (use listwise)
  • ❌ Unlimited compute budget

Deployment Examples

REST API Server

from fastapi import FastAPI
from pydantic import BaseModel
import torch

app = FastAPI()

# Load model once at startup
tokenizer, model = None, None

@app.on_event("startup")
async def load_model():
    global tokenizer, model
    tokenizer = AutoTokenizer.from_pretrained("abdoelsayed/dear-3b-reranker-ce-v1")
    model = AutoModelForSequenceClassification.from_pretrained(
        "abdoelsayed/dear-3b-reranker-ce-v1",
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    model.eval()
    if hasattr(torch, 'compile'):
        model = torch.compile(model)

class RerankRequest(BaseModel):
    query: str
    documents: List[str]

@app.post("/rerank")
async def rerank(request: RerankRequest):
    ranking = fast_rerank(tokenizer, model, request.query, 
                         [(""doc) for doc in request.documents])
    return {"ranking": ranking}

Batch Processing Script

import pandas as pd
from tqdm import tqdm

# Load queries and documents
df = pd.read_csv("queries_docs.csv")

results = []
for _, row in tqdm(df.iterrows()):
    ranking = fast_rerank(tokenizer, model, row['query'], row['documents'])
    results.append({
        'query_id': row['query_id'],
        'ranking': ranking
    })

# Save results
pd.DataFrame(results).to_csv("reranked.csv")

Model Architecture

Input: "query: [Q] [SEP] document: [D]"
    ↓
LLaMA-3.2-3B (24 layers, 3072 hidden)
    ↓
[CLS] Token Pooling
    ↓
Linear(3072 β†’ 1)
    ↓
Binary Relevance Score

Limitations

  1. Accuracy: ~3-4 NDCG@10 lower than 8B models
  2. Complex Queries: May miss subtle nuances
  3. Document Length: Limited to 196 tokens
  4. Language: English only
  5. Domain: Optimized for web documents

Related Models

DeAR 3B Family:

Larger Models:

Resources:

Citation

@article{abdallah2025dear,
  title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation},
  author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam},
  journal={arXiv preprint arXiv:2508.16998},
  year={2025}
}

License

MIT License

More Information

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for abdoelsayed/dear-3b-reranker-ce-v1

Finetuned
(345)
this model

Datasets used to train abdoelsayed/dear-3b-reranker-ce-v1

Collection including abdoelsayed/dear-3b-reranker-ce-v1