--- language: - en license: mit library_name: peft tags: - reranking - information-retrieval - pointwise - lora - peft - ranknet base_model: meta-llama/Llama-3.1-8B datasets: - Tevatron/msmarco-passage - abdoelsayed/DeAR-COT pipeline_tag: text-classification --- # DeAR-8B-Reranker-RankNet-LoRA-v1 ## Model Description **DeAR-8B-Reranker-RankNet-LoRA-v1** is a LoRA (Low-Rank Adaptation) adapter for neural reranking. This lightweight adapter can be applied to LLaMA-3.1-8B to create a reranker with minimal storage overhead. It achieves comparable performance to the full fine-tuned model while requiring only ~100MB of storage. ## Model Details - **Model Type:** LoRA Adapter for Pointwise Reranking - **Base Model:** meta-llama/Llama-3.1-8B - **Adapter Size:** ~100MB (vs 16GB for full model) - **Training Method:** LoRA with RankNet Loss + Knowledge Distillation - **LoRA Rank:** 16 - **LoRA Alpha:** 32 - **Target Modules:** q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj ## Key Features ✅ **Lightweight:** Only 100MB vs 16GB full model ✅ **Efficient Training:** Trains 3x faster than full fine-tuning ✅ **Easy Deployment:** Just load adapter on top of base model ✅ **Comparable Performance:** ~98% of full model performance ✅ **Memory Efficient:** Lower GPU memory during training ## Usage ### Option 1: Load with PEFT (Recommended) ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification from peft import PeftModel, PeftConfig # Load LoRA adapter adapter_path = "abdoelsayed/dear-8b-reranker-ranknet-lora-v1" # Get base model from adapter config config = PeftConfig.from_pretrained(adapter_path) base_model_name = config.base_model_name_or_path # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(base_model_name) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id # Load base model base_model = AutoModelForSequenceClassification.from_pretrained( base_model_name, num_labels=1, torch_dtype=torch.bfloat16 ) # Load and merge LoRA adapter model = PeftModel.from_pretrained(base_model, adapter_path) model = model.merge_and_unload() # Merge adapter into base model model.eval().cuda() # Use the model query = "What is machine learning?" document = "Machine learning is a subset of artificial intelligence..." inputs = tokenizer( f"query: {query}", f"document: {document}", return_tensors="pt", truncation=True, max_length=228, padding="max_length" ) inputs = {k: v.cuda() for k, v in inputs.items()} with torch.no_grad(): score = model(**inputs).logits.squeeze().item() print(f"Relevance score: {score}") ``` ### Option 2: Use Helper Function ```python import torch from typing import List, Tuple from transformers import AutoTokenizer, AutoModelForSequenceClassification from peft import PeftModel, PeftConfig def load_lora_ranker(adapter_path: str, device: str = "cuda"): """Load LoRA adapter and merge with base model.""" # Get base model path from adapter config peft_config = PeftConfig.from_pretrained(adapter_path) base_model_name = peft_config.base_model_name_or_path # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(base_model_name) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id tokenizer.padding_side = "right" # Load base model base_model = AutoModelForSequenceClassification.from_pretrained( base_model_name, num_labels=1, torch_dtype=torch.bfloat16 ) # Load LoRA adapter and merge model = PeftModel.from_pretrained(base_model, adapter_path) model = model.merge_and_unload() model.eval().to(device) return tokenizer, model # Load model tokenizer, model = load_lora_ranker("abdoelsayed/dear-8b-reranker-ranknet-lora-v1") # Rerank documents @torch.inference_mode() def rerank(tokenizer, model, query: str, docs: List[Tuple[str, str]], batch_size: int = 64): """Rerank documents for a query.""" device = next(model.parameters()).device scores = [] for i in range(0, len(docs), batch_size): batch = docs[i:i + batch_size] queries = [f"query: {query}"] * len(batch) documents = [f"document: {title} {text}" for title, text in batch] inputs = tokenizer( queries, documents, return_tensors="pt", truncation=True, max_length=228, padding=True ) inputs = {k: v.to(device) for k, v in inputs.items()} logits = model(**inputs).logits.squeeze(-1) scores.extend(logits.cpu().tolist()) return sorted(enumerate(scores), key=lambda x: x[1], reverse=True) # Example query = "When did Thomas Edison invent the light bulb?" docs = [ ("", "Thomas Edison invented the light bulb in 1879"), ("", "Coffee is good for diet"), ("", "Lightning strike at Seoul"), ] ranking = rerank(tokenizer, model, query, docs) print(ranking) # [(0, 5.2), (2, -3.1), (1, -4.8)] ``` ### Using Without Merging (Memory Efficient) ```python from peft import PeftModel, PeftConfig from transformers import AutoModelForSequenceClassification adapter_path = "abdoelsayed/dear-8b-reranker-ranknet-lora-v1" config = PeftConfig.from_pretrained(adapter_path) # Load base model base_model = AutoModelForSequenceClassification.from_pretrained( config.base_model_name_or_path, num_labels=1, torch_dtype=torch.bfloat16, device_map="auto" ) # Load adapter (without merging) model = PeftModel.from_pretrained(base_model, adapter_path) model.eval() # Use model (adapter layers will be applied automatically) # ... same inference code as above ... ``` ## Performance | Benchmark | LoRA | Full Model | Difference | |-----------|------|------------|------------| | TREC DL19 | 74.2 | 74.5 | -0.3 | | TREC DL20 | 72.5 | 72.8 | -0.3 | | BEIR (Avg) | 44.9 | 45.2 | -0.3 | | MS MARCO | 68.6 | 68.9 | -0.3 | ✅ **98% of full model performance with only 0.6% of the storage!** ## Training Details ### LoRA Configuration ```python lora_config = { "r": 16, # LoRA rank "lora_alpha": 32, # Scaling factor "target_modules": [ "q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], "lora_dropout": 0.05, "bias": "none", "task_type": "SEQ_CLS" } ``` ### Training Hyperparameters ```python training_args = { "learning_rate": 1e-4, # Higher than full fine-tuning "batch_size": 4, # Larger batch possible due to lower memory "gradient_accumulation": 2, "epochs": 2, "warmup_ratio": 0.1, "weight_decay": 0.01, "max_length": 228, "bf16": True } ``` ### Hardware - **GPUs:** 4x NVIDIA A100 (40GB) - **Training Time:** ~12 hours (3x faster than full model) - **Memory Usage:** ~28GB per GPU (vs ~38GB for full) - **Trainable Parameters:** 67M (0.8% of total) ## Advantages of LoRA Version | Aspect | LoRA | Full Model | |--------|------|------------| | Storage | 100MB | 16GB | | Training Time | 12h | 36h | | Training Memory | 28GB | 38GB | | Performance | 98% | 100% | | Loading Time | Fast | Slow | | Easy Updates | ✅ Yes | ❌ No | ## When to Use LoRA vs Full Model **Use LoRA when:** - ✅ Storage is limited - ✅ Training multiple domain-specific versions - ✅ Need fast iteration/experimentation - ✅ 0.3 NDCG@10 difference is acceptable **Use Full Model when:** - ❌ Maximum performance required - ❌ Storage not a concern - ❌ Single production deployment ## Fine-tuning on Your Data ```python from peft import LoraConfig, get_peft_model, TaskType from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer # Load base model base_model = AutoModelForSequenceClassification.from_pretrained( "meta-llama/Llama-3.1-8B", num_labels=1 ) # Configure LoRA lora_config = LoraConfig( task_type=TaskType.SEQ_CLS, r=16, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, bias="none", ) # Apply LoRA model = get_peft_model(base_model, lora_config) model.print_trainable_parameters() # Output: trainable params: 67M || all params: 8B || trainable%: 0.8% # Train training_args = TrainingArguments( output_dir="./lora-finetuned", learning_rate=1e-4, per_device_train_batch_size=8, num_train_epochs=3, bf16=True, ) trainer = Trainer( model=model, args=training_args, train_dataset=your_dataset, ) trainer.train() # Save only the LoRA adapter model.save_pretrained("./lora-adapter") ``` ## Model Files This adapter contains: - `adapter_config.json` - LoRA configuration - `adapter_model.safetensors` or `adapter_model.bin` - Adapter weights (~100MB) - `README.md` - This documentation ## Related Models **Full Model:** - [DeAR-8B-RankNet](https://huggingface.co/abdoelsayed/dear-8b-reranker-ranknet-v1) - Full fine-tuned version **Other LoRA Adapters:** - [DeAR-8B-CE-LoRA](https://huggingface.co/abdoelsayed/dear-8b-reranker-ce-lora-v1) - Binary Cross-Entropy - [DeAR-8B-Listwise-LoRA](https://huggingface.co/abdoelsayed/dear-8b-reranker-listwise-lora-v1) - Listwise ranking **Resources:** - [DeAR-COT Dataset](https://huggingface.co/datasets/abdoelsayed/DeAR-COT) - [Teacher Model](https://huggingface.co/abdoelsayed/llama2-13b-rankllama-teacher) ## Citation ```bibtex @article{abdallah2025dear, title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation}, author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam}, journal={arXiv preprint arXiv:2508.16998}, year={2025} } ``` ## License MIT License ## More Information - **GitHub:** [DataScienceUIBK/DeAR-Reranking](https://github.com/DataScienceUIBK/DeAR-Reranking) - **Paper:** [arXiv:2508.16998](https://arxiv.org/abs/2508.16998) - **Collection:** [DeAR Models](https://huggingface.co/collections/abdoelsayed/dear-reranking)