GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning
Paper
•
2402.16829
•
Published
This is a sentence-transformers model finetuned from microsoft/mpnet-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Areeb-02/mpnet-base-GISTEmbedLoss-MSEE_Evaluator-salestax-docs")
# Run inference
sentences = [
'Based on the context information provided, what are the different gross receipts tax rates for businesses in San Francisco for tax years 2022, 2023, and 2024?',
'$9.75 per $1,000) for taxable gross receipts over $25,000,000\n44SANCO\n2024 NAY LO\n(D) For tax year 2024 if the Controller certifies under Section 953.10 that the\nDEPARTMENT OF\n95% gross receipts threshold has been met for tax year 2024, and for tax years beginning on or after\nJanuary 1, 2025:\n0.814% (e.g. $8.14 per $1,000) for taxable gross receipts between $0 and $1,000,000\n0.853% (e.g. $8.53 per $1,000) for taxable gross receipts between $1,000,000.01 and\n$2,500,000\n0.93% (e.g. $9.30 per $1,000) for taxable gross receipts between $2,500,000.01 and\n$25,000,000\n1.008% (e.g. $10.08 per $1,000) for taxable gross receipts over $25,000,000\n(3) For all business activities not otherwise exempt and not elsewhere\nsubjected to a gross receipts tax rate or an administrative office tax by this Article 12-A-1:\n(B) For tax years 2022 and, if the Controller does not certify under\nSection 953.10 that the 90% gross receipts threshold has been met for tax year 2023, for tax\nyear 2023:\n0.788% (e.g. $7.88 per $1,000) for taxable gross receipts between $0 and $1,000,000\n0.825% (e.g. $8.25 per $1,000) for taxable gross receipts between $1,000,000.01 and\n$2,500,000\n0.9% (e.g. $9 per $1,000) for taxable gross receipts between $2,500,000.01 and\n$25,000,000\n0.975% (e.g. $9.75 per $1,000) for taxable gross receipts over $25,000,000\n(C) For tax year 2023 if the Controller certifies under Section 953.10 that the\n90% gross receipts threshold has been met for tax year 2023,',
'(d) In no event shall the credit under this Section 960.4 reduce a person or combined group\'s\nGross Receipts Tax liability to less than $0 for any tax year. The credit under this Section shall not be\nrefundable and may not be carried forward to a subsequent year.\nSEC. 966. CONTROLLER REPORTS.\nThe Controller shall prepare reports by September 1, 2026, and September 1, 2027,\nrespectively, that discuss current economic conditions in the City and the performance of the tax system\nrevised by the voters in the ordinance adding this Section 966.\nSection 6. Article 21 of the Business and Tax Regulations Code is hereby amended by\nrevising Section 2106 to read as follows:\nSEC. 2106. SMALL BUSINESS EXEMPTION.\n(a) For tax years ending on or before December 31, 2024, nNotwithstanding any other\nprovision of this Article 21, a person or combined group exempt from payment of the gross\nreceipts tax under Section 954.1 of Article 12-A-1, as amended from time to time, shall also\nbe exempt from payment of the Early Care and Education Commercial Rents Tax.\n79SAN\nDL W(b) For tax years beginning on or after January 1, 2025, notwithstanding any other provision\nof this Article 21, a "small business enterprise" shall be exempt from payment of the Early Care and\nEducation Commercial Rents Tax. For purposes of this subsection (b), the term "small business\nenterprise" shall mean any person or combined group whose gross receipts within the City, determined\nunder Article 12-A-1, did not exceed $2,325,000, adjusted annually in accordance with the increase in\nthe Consumer Price Index: All Urban Consumers for the San Francisco/Oakland/Hayward Area for All\nItems as reported by the United States Bureau of Labor Statistics, or any successor to that index, as of\nDecember 31 of the calendar year two years prior to the tax year, beginning with tax year 2026, and\nrounded to the nearest $10,000. This subsection (b) shall not apply to a person or combined group\nsubject to a tax on administrative office business activities in Section 953.8 of Article 12-A-1.\nSection 7.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
stsb-devMSEEvaluator| Metric | Value |
|---|---|
| negative_mse | -2.4282 |
sentence1 and sentence2| sentence1 | sentence2 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence1 | sentence2 |
|---|---|
What types of businesses are subject to the gross receipts tax in San Francisco, and how is their San Francisco gross receipts calculated? What are the current rates for this tax, and are there any exemptions or scheduled increases? |
The Way It Is Now |
What is the homelessness gross receipts tax, and which businesses are required to pay it? What are the current rates for this tax, and how do they vary based on the amount of San Francisco gross receipts? Are there any exemptions or scheduled increases for this tax? |
The Way It Is Now |
What is the proposed measure that voters may approve to change the City's business taxes in San Francisco? |
The |
GISTEmbedLoss with these parameters:{'guide': SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
), 'temperature': 0.01}
eval_strategy: stepsper_device_train_batch_size: 16per_device_eval_batch_size: 16num_train_epochs: 1warmup_ratio: 0.1overwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 1max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportional| Epoch | Step | stsb-dev_negative_mse |
|---|---|---|
| 0 | 0 | -2.4282 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{solatorio2024gistembed,
title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
author={Aivin V. Solatorio},
year={2024},
eprint={2402.16829},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Base model
microsoft/mpnet-base