Alibaba-NLP
/

gte-multilingual-base

Sentence Similarity

sentence-transformers

feature-extraction

text-embeddings-inference

Model card Files Files and versions

zyznull commited on Jul 27, 2024

Commit

2d7b768

·

verified ·

1 Parent(s): 0119b51

Update README.md

Files changed (1) hide show

README.md +75 -0

README.md CHANGED Viewed

@@ -30,3 +30,78 @@ transformers>=4.39.2
 flash_attn>=2.5.6
 ```
 ## Usage

 flash_attn>=2.5.6
 ```
 ## Usage
+Get Dense Embeddings with Transformers
+```
+# Requires transformers>=4.36.0
+import torch.nn.functional as F
+from transformers import AutoModel, AutoTokenizer
+input_texts = [
+    "what is the capital of China?",
+    "how to implement quick sort in python?",
+    "北京",
+    "快排算法介绍"
+]
+model_path = 'Alibaba-NLP/gte-multilingual-base'
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
+# Tokenize the input texts
+batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
+outputs = model(**batch_dict)
+dimension=768 # The output dimension of the output embedding, should be in [128, 768]
+embeddings = outputs.last_hidden_state[:, 0][:dimension]
+embeddings = F.normalize(embeddings, p=2, dim=1)
+scores = (embeddings[:1] @ embeddings[1:].T) * 100
+print(scores.tolist())
+```
+Use with sentence-transformers
+```
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.util import cos_sim
+input_texts = [
+    "what is the capital of China?",
+    "how to implement quick sort in python?",
+    "北京",
+    "快排算法介绍"
+]
+model = SentenceTransformer('Alibaba-NLP/gte-multilingual-base', trust_remote_code=True)
+embeddings = model.encode(input_texts)
+```
+Use with custom code to get dense embeddigns and sparse token weights
+```
+# You can find the gte_embeddings.py in https://huggingface.co/Alibaba-NLP/gte-multilingual-base/blob/main/scripts/gte_embedding.py
+from gte_embeddings import GTEEmbeddidng
+model_path = 'Alibaba-NLP/gte-multilingual-base'
+model = GTEEmbeddidng(model_path)
+query = "中国的首都在哪儿"
+docs = [
+    "what is the capital of China?",
+    "how to implement quick sort in python?",
+    "北京",
+    "快排算法介绍"
+]
+embs = model.encode(docs, return_dense=True,return_sparse=True)
+print('dense_embeddings vecs', embs['dense_embeddings'])
+print('token_weights', embs['token_weights'])
+pairs = [(query, doc) for doc in docs]
+dense_scores = model.compute_scores(pairs, dense_weight=1.0, sparse_weight=0.0)
+sparse_scores = model.compute_scores(pairs, dense_weight=0.0, sparse_weight=1.0)
+hybird_scores = model.compute_scores(pairs, dense_weight=1.0, sparse_weight=0.3)
+print('dense_scores', dense_scores)
+print('sparse_scores', sparse_scores)
+print('hybird_scores', hybird_scores)
+```