DeAR-8B-Reranker-RankNet-LoRA-v1

Model Description

DeAR-8B-Reranker-RankNet-LoRA-v1 is a LoRA (Low-Rank Adaptation) adapter for neural reranking. This lightweight adapter can be applied to LLaMA-3.1-8B to create a reranker with minimal storage overhead. It achieves comparable performance to the full fine-tuned model while requiring only ~100MB of storage.

Model Details

  • Model Type: LoRA Adapter for Pointwise Reranking
  • Base Model: meta-llama/Llama-3.1-8B
  • Adapter Size: ~100MB (vs 16GB for full model)
  • Training Method: LoRA with RankNet Loss + Knowledge Distillation
  • LoRA Rank: 16
  • LoRA Alpha: 32
  • Target Modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj

Key Features

βœ… Lightweight: Only 100MB vs 16GB full model
βœ… Efficient Training: Trains 3x faster than full fine-tuning
βœ… Easy Deployment: Just load adapter on top of base model
βœ… Comparable Performance: ~98% of full model performance
βœ… Memory Efficient: Lower GPU memory during training

Usage

Option 1: Load with PEFT (Recommended)

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel, PeftConfig

# Load LoRA adapter
adapter_path = "abdoelsayed/dear-8b-reranker-ranknet-lora-v1"

# Get base model from adapter config
config = PeftConfig.from_pretrained(adapter_path)
base_model_name = config.base_model_name_or_path

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Load base model
base_model = AutoModelForSequenceClassification.from_pretrained(
    base_model_name,
    num_labels=1,
    torch_dtype=torch.bfloat16
)

# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload()  # Merge adapter into base model

model.eval().cuda()

# Use the model
query = "What is machine learning?"
document = "Machine learning is a subset of artificial intelligence..."

inputs = tokenizer(
    f"query: {query}",
    f"document: {document}",
    return_tensors="pt",
    truncation=True,
    max_length=228,
    padding="max_length"
)
inputs = {k: v.cuda() for k, v in inputs.items()}

with torch.no_grad():
    score = model(**inputs).logits.squeeze().item()
    
print(f"Relevance score: {score}")

Option 2: Use Helper Function

import torch
from typing import List, Tuple
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel, PeftConfig

def load_lora_ranker(adapter_path: str, device: str = "cuda"):
    """Load LoRA adapter and merge with base model."""
    # Get base model path from adapter config
    peft_config = PeftConfig.from_pretrained(adapter_path)
    base_model_name = peft_config.base_model_name_or_path
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id
    tokenizer.padding_side = "right"
    
    # Load base model
    base_model = AutoModelForSequenceClassification.from_pretrained(
        base_model_name,
        num_labels=1,
        torch_dtype=torch.bfloat16
    )
    
    # Load LoRA adapter and merge
    model = PeftModel.from_pretrained(base_model, adapter_path)
    model = model.merge_and_unload()
    
    model.eval().to(device)
    return tokenizer, model

# Load model
tokenizer, model = load_lora_ranker("abdoelsayed/dear-8b-reranker-ranknet-lora-v1")

# Rerank documents
@torch.inference_mode()
def rerank(tokenizer, model, query: str, docs: List[Tuple[str, str]], batch_size: int = 64):
    """Rerank documents for a query."""
    device = next(model.parameters()).device
    scores = []
    
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i + batch_size]
        queries = [f"query: {query}"] * len(batch)
        documents = [f"document: {title} {text}" for title, text in batch]
        
        inputs = tokenizer(
            queries,
            documents,
            return_tensors="pt",
            truncation=True,
            max_length=228,
            padding=True
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        logits = model(**inputs).logits.squeeze(-1)
        scores.extend(logits.cpu().tolist())
    
    return sorted(enumerate(scores), key=lambda x: x[1], reverse=True)

# Example
query = "When did Thomas Edison invent the light bulb?"
docs = [
    ("", "Thomas Edison invented the light bulb in 1879"),
    ("", "Coffee is good for diet"),
    ("", "Lightning strike at Seoul"),
]

ranking = rerank(tokenizer, model, query, docs)
print(ranking)  # [(0, 5.2), (2, -3.1), (1, -4.8)]

Using Without Merging (Memory Efficient)

from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification

adapter_path = "abdoelsayed/dear-8b-reranker-ranknet-lora-v1"
config = PeftConfig.from_pretrained(adapter_path)

# Load base model
base_model = AutoModelForSequenceClassification.from_pretrained(
    config.base_model_name_or_path,
    num_labels=1,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load adapter (without merging)
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()

# Use model (adapter layers will be applied automatically)
# ... same inference code as above ...

Performance

Benchmark LoRA Full Model Difference
TREC DL19 74.2 74.5 -0.3
TREC DL20 72.5 72.8 -0.3
BEIR (Avg) 44.9 45.2 -0.3
MS MARCO 68.6 68.9 -0.3

βœ… 98% of full model performance with only 0.6% of the storage!

Training Details

LoRA Configuration

lora_config = {
    "r": 16,  # LoRA rank
    "lora_alpha": 32,  # Scaling factor
    "target_modules": [
        "q_proj", "v_proj", "k_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "SEQ_CLS"
}

Training Hyperparameters

training_args = {
    "learning_rate": 1e-4,  # Higher than full fine-tuning
    "batch_size": 4,  # Larger batch possible due to lower memory
    "gradient_accumulation": 2,
    "epochs": 2,
    "warmup_ratio": 0.1,
    "weight_decay": 0.01,
    "max_length": 228,
    "bf16": True
}

Hardware

  • GPUs: 4x NVIDIA A100 (40GB)
  • Training Time: ~12 hours (3x faster than full model)
  • Memory Usage: ~28GB per GPU (vs ~38GB for full)
  • Trainable Parameters: 67M (0.8% of total)

Advantages of LoRA Version

Aspect LoRA Full Model
Storage 100MB 16GB
Training Time 12h 36h
Training Memory 28GB 38GB
Performance 98% 100%
Loading Time Fast Slow
Easy Updates βœ… Yes ❌ No

When to Use LoRA vs Full Model

Use LoRA when:

  • βœ… Storage is limited
  • βœ… Training multiple domain-specific versions
  • βœ… Need fast iteration/experimentation
  • βœ… 0.3 NDCG@10 difference is acceptable

Use Full Model when:

  • ❌ Maximum performance required
  • ❌ Storage not a concern
  • ❌ Single production deployment

Fine-tuning on Your Data

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# Load base model
base_model = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    num_labels=1
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
)

# Apply LoRA
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 67M || all params: 8B || trainable%: 0.8%

# Train
training_args = TrainingArguments(
    output_dir="./lora-finetuned",
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    bf16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=your_dataset,
)

trainer.train()

# Save only the LoRA adapter
model.save_pretrained("./lora-adapter")

Model Files

This adapter contains:

  • adapter_config.json - LoRA configuration
  • adapter_model.safetensors or adapter_model.bin - Adapter weights (~100MB)
  • README.md - This documentation

Related Models

Full Model:

Other LoRA Adapters:

Resources:

Citation

@article{abdallah2025dear,
  title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation},
  author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam},
  journal={arXiv preprint arXiv:2508.16998},
  year={2025}
}

License

MIT License

More Information

Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for abdoelsayed/dear-8b-reranker-ranknet-lora-v1

Adapter
(495)
this model

Datasets used to train abdoelsayed/dear-8b-reranker-ranknet-lora-v1

Collection including abdoelsayed/dear-8b-reranker-ranknet-lora-v1