DeAR-3B-Reranker-CE-v1

Model Description

DeAR-3B-Reranker-CE-v1 is a 3B parameter efficient neural reranker trained with Binary Cross-Entropy loss and knowledge distillation. This model provides fast, reliable reranking for production environments where speed and efficiency are critical.

Model Details

Model Type: Pointwise Reranker (Binary Classification)
Base Model: LLaMA-3.2-3B
Parameters: 3 billion
Training Method: Knowledge Distillation + Binary Cross-Entropy
Teacher Model: LLaMA2-13B-RankLLaMA
Training Data: MS MARCO + DeAR-COT
Model Size: 6GB (BF16)

Key Features

✅ Ultra Fast: 1.5s inference (best in DeAR family)
✅ Memory Efficient: Runs on single 16GB GPU
✅ Production Ready: Stable training with BCE loss
✅ Cost Effective: Lower computational costs
✅ Binary Classification: Probabilistic relevance scores

Usage

Quick Start

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model
model_path = "abdoelsayed/dear-3b-reranker-ce-v1"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16
)
model.eval().cuda()

# Score a query-document pair
query = "What is llama?"
document = "The llama is a domesticated South American camelid..."

inputs = tokenizer(
    f"query: {query}",
    f"document: {document}",
    return_tensors="pt",
    truncation=True,
    max_length=228,
    padding="max_length"
)
inputs = {k: v.cuda() for k, v in inputs.items()}

with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()
    
print(f"Relevance score: {score}")

Efficient Batch Processing

import torch
from typing import List, Tuple

@torch.inference_mode()
def fast_rerank(tokenizer, model, query: str, docs: List[Tuple[str, str]], batch_size: int = 128):
    """Fast reranking optimized for 3B model."""
    device = next(model.parameters()).device
    scores = []
    
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i + batch_size]
        
        # Prepare batch
        queries = [f"query: {query}"] * len(batch)
        documents = [f"document: {t} {p}" for t, p in batch]
        
        # Tokenize
        inputs = tokenizer(
            queries,
            documents,
            return_tensors="pt",
            truncation=True,
            max_length=228,
            padding=True
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Score
        logits = model(**inputs).logits.squeeze(-1)
        scores.extend(logits.cpu().tolist())
    
    # Rank
    return sorted(enumerate(scores), key=lambda x: x[1], reverse=True)


# Example
query = "When did Thomas Edison invent the light bulb?"
docs = [
    ("", "Thomas Edison invented the light bulb in 1879"),
    ("", "Coffee is good for diet"),
    ("", "Lightning strike at Seoul National University"),
]

ranking = fast_rerank(tokenizer, model, query, docs, batch_size=128)
print(ranking)
# DeAR-P-3B-BC Output:
# [(0, -6.0625), (2, -11.125), (1, -12.0625)]

Production Optimization

# Optimize for maximum throughput
model = AutoModelForSequenceClassification.from_pretrained(
    "abdoelsayed/dear-3b-reranker-ce-v1",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model.eval()

# Compile for 20-30% speedup (PyTorch 2.0+)
if hasattr(torch, 'compile'):
    model = torch.compile(model, mode="max-autotune")

# Use larger batches for throughput
batch_size = 128  # 3B can handle larger batches

Training Details

Training Configuration

{
    "base_model": "meta-llama/Llama-3.2-3B",
    "teacher_model": "abdoelsayed/llama2-13b-rankllama-teacher",
    "loss": "Binary Cross-Entropy",
    "distillation": {
        "temperature": 2.0,
        "alpha": 0.1
    },
    "learning_rate": 1e-4,
    "batch_size": 4,
    "gradient_accumulation": 2,
    "epochs": 2,
    "max_length": 228,
    "bf16": true
}

Hardware

GPUs: 4x NVIDIA A100 (40GB)
Training Time: ~17 hours
Memory Usage: ~24GB per GPU
Trainable Parameters: 3B

Evaluation Results

TREC Deep Learning

Dataset	NDCG@10	NDCG@20	MRR@10
DL19	70.8	67.3	83.9
DL20	68.9	65.8	81.7

BEIR Benchmark

Dataset	NDCG@10
MS MARCO	65.3
NQ	48.7
HotpotQA	57.9
FiQA	43.6
ArguAna	55.8
SciFact	70.2
TREC-COVID	81.8
NFCorpus	37.2
Average	41.7

Efficiency

Metric	3B-CE	8B-CE	Improvement
Inference (100 docs)	1.5s	2.2s	1.5x faster
Throughput	67 docs/s	45 docs/s	1.5x
GPU Memory	12GB	18GB	33% less
Model Size	6GB	16GB	62% smaller

Comparison

vs. Other 3B Models

Model	Loss	DL19	DL20	Speed (s)
DeAR-3B-CE	BCE	70.8	68.9	1.5
DeAR-3B-RankNet	RankNet	71.2	69.4	1.5
MonoT5-3B	-	71.8	68.9	3.5

Key Advantages:

2.3x faster than MonoT5-3B
Comparable accuracy
More stable training (BCE vs complex losses)

When to Use

Best for:

✅ High-throughput production systems
✅ Real-time applications (latency <2s)
✅ Cost-sensitive deployments
✅ Edge deployment (smaller GPUs)
✅ Binary relevance tasks

Consider alternatives for:

❌ Maximum accuracy (use 8B models)
❌ Complex reasoning queries (use listwise)
❌ Unlimited compute budget

Deployment Examples

REST API Server

from fastapi import FastAPI
from pydantic import BaseModel
import torch

app = FastAPI()

# Load model once at startup
tokenizer, model = None, None

@app.on_event("startup")
async def load_model():
    global tokenizer, model
    tokenizer = AutoTokenizer.from_pretrained("abdoelsayed/dear-3b-reranker-ce-v1")
    model = AutoModelForSequenceClassification.from_pretrained(
        "abdoelsayed/dear-3b-reranker-ce-v1",
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    model.eval()
    if hasattr(torch, 'compile'):
        model = torch.compile(model)

class RerankRequest(BaseModel):
    query: str
    documents: List[str]

@app.post("/rerank")
async def rerank(request: RerankRequest):
    ranking = fast_rerank(tokenizer, model, request.query, 
                         [(""doc) for doc in request.documents])
    return {"ranking": ranking}

Batch Processing Script

import pandas as pd
from tqdm import tqdm

# Load queries and documents
df = pd.read_csv("queries_docs.csv")

results = []
for _, row in tqdm(df.iterrows()):
    ranking = fast_rerank(tokenizer, model, row['query'], row['documents'])
    results.append({
        'query_id': row['query_id'],
        'ranking': ranking
    })

# Save results
pd.DataFrame(results).to_csv("reranked.csv")

Model Architecture

Input: "query: [Q] [SEP] document: [D]"
    ↓
LLaMA-3.2-3B (24 layers, 3072 hidden)
    ↓
[CLS] Token Pooling
    ↓
Linear(3072 → 1)
    ↓
Binary Relevance Score

Limitations

Accuracy: ~3-4 NDCG@10 lower than 8B models
Complex Queries: May miss subtle nuances
Document Length: Limited to 196 tokens
Language: English only
Domain: Optimized for web documents

Related Models

DeAR 3B Family:

DeAR-3B-RankNet - RankNet variant (slightly better)
DeAR-3B-CE-LoRA - LoRA adapter

Larger Models:

DeAR-8B-CE - Higher accuracy

Resources:

Citation

@article{abdallah2025dear,
  title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation},
  author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam},
  journal={arXiv preprint arXiv:2508.16998},
  year={2025}
}

License

MIT License

More Information

GitHub: DataScienceUIBK/DeAR-Reranking
Paper: arXiv:2508.16998
Collection: DeAR Models

Downloads last month: 28

Model tree for abdoelsayed/dear-3b-reranker-ce-v1

Base model

meta-llama/Llama-3.2-3B

Finetuned

(345)

this model

Datasets used to train abdoelsayed/dear-3b-reranker-ce-v1

Collection including abdoelsayed/dear-3b-reranker-ce-v1

DeAR-Reranking

Collection

DeAR (Deep Agent Rank): Dual-Stage Document Reranking with Reasoning Agents Accepted at EMNLP Findings 2025 • 12 items • Updated 12 days ago • 1