| # MSMARCO Models | |
| [MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query. | |
| The training data constist of over 500k examples, while the complete corpus consist of over 8.8 Million passages. | |
| ## Version Histroy | |
| As we work on the topic, we will publish updated (and improved) models. | |
| ### v1 | |
| Version 1 models were trained on the training set of MS Marco Passage retrieval task. The models were trained using in-batch negative sampling via the MultipleNegativesRankingLoss with a scaling factor of 20 and a batch size of 128. | |
| They can be used like this: | |
| ```python | |
| from sentence_transformers import SentenceTransformer, util | |
| model = SentenceTransformer('distilroberta-base-msmarco-v1') | |
| query_embedding = model.encode('[QRY] ' + 'How big is London') | |
| passage_embedding = model.encode('[DOC] ' + 'London has 9,787,426 inhabitants at the 2011 census') | |
| print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding)) | |
| ``` | |
| **Models**: | |
| - **distilroberta-base-msmarco-v1** - Performance MSMARCO dev dataset (queries.dev.small.tsv) MRR@10: 23.28 |