Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
Abstract
E\textsuperscript{2}R-FLOPs metrics, including RPP and QPP, provide a hardware-agnostic evaluation of LLM-based rerankers' efficiency and effectiveness.
Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose E2R-FLOPs, for LLM-based rerankers: ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for hardware-agnostic throughput. Companied with the new metrics, an interpretable FLOPs estimator is built to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architecture, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.
Community
This paper proposes E²R-FLOPS, a framework for evaluating the efficiency of LLM-based rerankers using hardware-agnostic metrics: ranking metrics per PetaFLOP (RPP) and queries per PetaFLOP (QPP). Unlike existing proxy metrics (e.g., latency or token count), these new metrics account for model size and compute cost. To support this, we propose an interpretable FLOPs estimator built upon E²R-FLOPS, enabling efficiency analysis without running the model. Comprehensive experiments highlight the efficiency-effectiveness trade-off across diverse LLM rerankers, promoting more interpretable and fair comparisons.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
 Tingyu Song
							Tingyu Song 
					 
					 
					 
					