remove details about v1 from other checkpoint (#4)
Browse files- remove details about v1 from other checkpoint (869be4070611ad5b66a9349cdcfd72040ac5813e)
Co-authored-by: Max Cembalest <[email protected]>
README.md
CHANGED
|
@@ -2612,110 +2612,10 @@ model-index:
|
|
| 2612 |
# nomic-embed-text-v1-unsupervised: A Reproducible Long Context (8192) Text Embedder
|
| 2613 |
|
| 2614 |
`nomic-embed-text-v1-unsupervised` is 8192 context length text encoder. This is a checkpoint after contrastive pretraining from multi-stage contrastive training of the
|
| 2615 |
-
[final model](https://huggingface.co/nomic-ai/nomic-embed-text-v1).
|
| 2616 |
-
.
|
| 2617 |
|
|
|
|
| 2618 |
|
| 2619 |
-
| Name | SeqLen | MTEB | LoCo | Jina Long Context | Open Weights | Open Training Code | Open Data |
|
| 2620 |
-
| :-------------------------------:| :----- | :-------- | :------: | :---------------: | :-----------: | :----------------: | :---------- |
|
| 2621 |
-
| nomic-embed-text-v1 | 8192 | **62.39** |**85.53** | 54.16 | ✅ | ✅ | ✅ |
|
| 2622 |
-
| jina-embeddings-v2-base-en | 8192 | 60.39 | 85.45 | 51.90 | ✅ | ❌ | ❌ |
|
| 2623 |
-
| text-embedding-3-small | 8191 | 62.26 | 82.40 | **58.20** | ❌ | ❌ | ❌ |
|
| 2624 |
-
| text-embedding-ada-002 | 8191 | 60.99 | 52.7 | 55.25 | ❌ | ❌ | ❌ |
|
| 2625 |
-
|
| 2626 |
-
|
| 2627 |
-
If you would like to finetune a model on more data, you can use this model as an initialization
|
| 2628 |
-
|
| 2629 |
-
## Hosted Inference API
|
| 2630 |
-
|
| 2631 |
-
The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
|
| 2632 |
-
|
| 2633 |
-
Generating embeddings with the `nomic` Python client is as easy as
|
| 2634 |
-
|
| 2635 |
-
```python
|
| 2636 |
-
from nomic import embed
|
| 2637 |
-
|
| 2638 |
-
output = embed.text(
|
| 2639 |
-
texts=['Nomic Embedding API', '#keepAIOpen'],
|
| 2640 |
-
model='nomic-embed-text-v1',
|
| 2641 |
-
task_type='search_document'
|
| 2642 |
-
)
|
| 2643 |
-
|
| 2644 |
-
print(output)
|
| 2645 |
-
```
|
| 2646 |
-
|
| 2647 |
-
For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
|
| 2648 |
-
|
| 2649 |
-
## Data Visualization
|
| 2650 |
-
Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
|
| 2651 |
-
|
| 2652 |
-
|
| 2653 |
-
[](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
|
| 2654 |
-
|
| 2655 |
-
|
| 2656 |
-
## Training Details
|
| 2657 |
-
|
| 2658 |
-
We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
|
| 2659 |
-
the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
|
| 2660 |
-
|
| 2661 |
-
In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
|
| 2662 |
-
|
| 2663 |
-
For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1).
|
| 2664 |
-
|
| 2665 |
-
Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
|
| 2666 |
-
|
| 2667 |
-
## Usage
|
| 2668 |
-
|
| 2669 |
-
Note `nomic-embed-text` requires prefixes! We support the prefixes `[search_query, search_document, classification, clustering]`.
|
| 2670 |
-
For retrieval applications, you should prepend `search_document` for all your documents and `search_query` for your queries.
|
| 2671 |
-
|
| 2672 |
-
### Sentence Transformers
|
| 2673 |
-
```python
|
| 2674 |
-
from sentence_transformers import SentenceTransformer
|
| 2675 |
-
|
| 2676 |
-
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1-unsupervised", trust_remote_code=True)
|
| 2677 |
-
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
|
| 2678 |
-
embeddings = model.encode(sentences)
|
| 2679 |
-
print(embeddings)
|
| 2680 |
-
```
|
| 2681 |
-
|
| 2682 |
-
### Transformers
|
| 2683 |
-
```python
|
| 2684 |
-
import torch
|
| 2685 |
-
import torch.nn.functional as F
|
| 2686 |
-
from transformers import AutoTokenizer, AutoModel
|
| 2687 |
-
|
| 2688 |
-
def mean_pooling(model_output, attention_mask):
|
| 2689 |
-
token_embeddings = model_output[0]
|
| 2690 |
-
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
| 2691 |
-
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
| 2692 |
-
|
| 2693 |
-
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
|
| 2694 |
-
|
| 2695 |
-
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
|
| 2696 |
-
model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True)
|
| 2697 |
-
model.eval()
|
| 2698 |
-
|
| 2699 |
-
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
| 2700 |
-
|
| 2701 |
-
with torch.no_grad():
|
| 2702 |
-
model_output = model(**encoded_input)
|
| 2703 |
-
|
| 2704 |
-
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
|
| 2705 |
-
embeddings = F.normalize(embeddings, p=2, dim=1)
|
| 2706 |
-
print(embeddings)
|
| 2707 |
-
```
|
| 2708 |
-
|
| 2709 |
-
The model natively supports scaling of the sequence length past 2048 tokens. To do so,
|
| 2710 |
-
|
| 2711 |
-
```diff
|
| 2712 |
-
- tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
|
| 2713 |
-
+ tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=8192)
|
| 2714 |
-
|
| 2715 |
-
|
| 2716 |
-
- model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True)
|
| 2717 |
-
+ model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True, rotary_scaling_factor=2)
|
| 2718 |
-
```
|
| 2719 |
|
| 2720 |
# Join the Nomic Community
|
| 2721 |
|
|
|
|
| 2612 |
# nomic-embed-text-v1-unsupervised: A Reproducible Long Context (8192) Text Embedder
|
| 2613 |
|
| 2614 |
`nomic-embed-text-v1-unsupervised` is 8192 context length text encoder. This is a checkpoint after contrastive pretraining from multi-stage contrastive training of the
|
| 2615 |
+
[final model](https://huggingface.co/nomic-ai/nomic-embed-text-v1). The purpose of releasing this checkpoint is to open-source training artifacts from our Nomic Embed Text tech report [here](https://arxiv.org/pdf/2402.01613)
|
|
|
|
| 2616 |
|
| 2617 |
+
If you want to use a model to extract embeddings, we suggest using [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1).
|
| 2618 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2619 |
|
| 2620 |
# Join the Nomic Community
|
| 2621 |
|