Do We Need Domain-Specific Embedding Models? An Empirical Investigation
Abstract
State-of-the-art embedding models trained on general-purpose corpora underperform on domain-specific datasets, indicating the need for domain-specific models in the large language model era.
Embedding models play a crucial role in representing and retrieving information across various NLP applications. Recent advancements in Large Language Models (LLMs) have further enhanced the performance of embedding models, which are trained on massive amounts of text covering almost every domain. These models are often benchmarked on general-purpose datasets like Massive Text Embedding Benchmark (MTEB), where they demonstrate superior performance. However, a critical question arises: Is the development of domain-specific embedding models necessary when general-purpose models are trained on vast corpora that already include specialized domain texts? In this paper, we empirically investigate this question, choosing the finance domain as an example. We introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a counterpart to MTEB that consists of financial domain-specific text datasets. We evaluate the performance of seven state-of-the-art embedding models on FinMTEB and observe a significant performance drop compared to their performance on MTEB. To account for the possibility that this drop is driven by FinMTEB's higher complexity, we propose four measures to quantify dataset complexity and control for this factor in our analysis. Our analysis provides compelling evidence that state-of-the-art embedding models struggle to capture domain-specific linguistic and semantic patterns, even when trained on large general-purpose corpora. This study sheds light on the necessity of developing domain-specific embedding models in the LLM era, offering valuable insights for researchers and practitioners.
Community
Thanks for the community, and a newer version with more comprehensive experiments is here: 🤗
Newer Version: FinMTEB: Finance Massive Text Embedding Benchmark
Link: https://arxiv.org/abs/2502.10990
Github: https://github.com/yixuantt/FinMTEB
Leaderboard: https://huggingface.co/spaces/FinanceMTEB/FinMTEB
Model: yixuantt/Fin-e5
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper