pushNe

2359bda over 2 years ago

1.34 kB

	# MSMARCO Models
	[MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.

	The training data constist of over 500k examples, while the complete corpus consist of over 8.8 Million passages.



	## Version Histroy
	As we work on the topic, we will publish updated (and improved) models.

	### v1
	Version 1 models were trained on the training set of MS Marco Passage retrieval task. The models were trained using in-batch negative sampling via the MultipleNegativesRankingLoss with a scaling factor of 20 and a batch size of 128.

	They can be used like this:
	```python
	from sentence_transformers import SentenceTransformer, util
	model = SentenceTransformer('distilroberta-base-msmarco-v1')

	query_embedding = model.encode('[QRY] ' + 'How big is London')
	passage_embedding = model.encode('[DOC] ' + 'London has 9,787,426 inhabitants at the 2011 census')

	print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))
	```

	Models:
	- distilroberta-base-msmarco-v1 - Performance MSMARCO dev dataset (queries.dev.small.tsv) MRR@10: 23.28