Initial model upload (model + tokenizer + config)

Browse files

Files changed (10) hide show

README.md +200 -0
added_tokens.json +3 -0
config.json +46 -0
deberta_cal.pkl +3 -0
model.safetensors +3 -0
special_tokens_map.json +15 -0
spm.model +3 -0
tokenizer.json +0 -0
tokenizer_config.json +59 -0
training_args.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,200 @@

+---
+license: mit
+base_model: microsoft/deberta-v3-base
+tags:
+- text-classification
+- question-answering
+- semantic-similarity
+- quora
+- duplicate-detection
+- transformers
+- pytorch
+datasets:
+- quora
+language:
+- en
+metrics:
+- roc_auc
+model-index:
+- name: deberta-v3-quora-question-pairs
+  results:
+  - task:
+      type: text-classification
+      name: Question Pair Duplicate Detection
+    dataset:
+      name: Quora Question Pairs
+      type: quora
+    metrics:
+    - type: roc_auc
+      value: 0.9759
+      name: ROC AUC
+---
+# DeBERTa-v3 for Quora Question Pairs Duplicate Detection
+A fine-tuned DeBERTa-v3-base model for identifying duplicate question pairs, achieving 97.59% ROC AUC on the Quora Question Pairs dataset.
+## Model Description
+This model is a fine-tuned version of [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) on the Quora Question Pairs dataset. It uses a cross-encoder architecture to determine whether two questions are semantically equivalent.
+**Key Features:**
+- Cross-encoder architecture for superior accuracy
+- Probability calibration for reliable confidence estimates
+- Robust handling of missing/empty questions
+- Production-ready inference pipeline
+## Performance
+| Metric | Value |
+|--------|-------|
+| ROC AUC | 97.59% |
+| Training Loss | 0.116 |
+| Validation Loss | 0.214 |
+## Intended Use
+**Primary Use Cases:**
+- Question deduplication systems
+- Semantic similarity detection
+- Content moderation for duplicate questions
+- Search and retrieval systems
+**Out-of-Scope Use:**
+- General text similarity (model is optimized for questions)
+- Languages other than English
+- Longer texts (trained on max 128 tokens)
+## Usage
+### Basic Inference
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model_name = "your-username/deberta-v3-quora-question-pairs"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Example usage
+question1 = "How do I learn Python programming?"
+question2 = "What's the best way to learn Python?"
+# Tokenize and predict
+inputs = tokenizer(question1, question2,
+                  truncation=True, padding=True,
+                  max_length=128, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**inputs)
+    logits = outputs.logits
+    probability = torch.softmax(logits, dim=-1)[0, 1].item()
+print(f"Duplicate probability: {probability:.3f}")
+```
+### With Probability Calibration (Recommended)
+For the most accurate confidence estimates, use the included calibrator:
+```python
+import joblib
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model, tokenizer, and calibrator
+model_name = "your-username/deberta-v3-quora-question-pairs"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Note: Download the calibrator separately from the model repository
+calibrator = joblib.load("deberta_cal.pkl")
+def predict_duplicate(question1, question2):
+    # Get raw prediction
+    inputs = tokenizer(question1, question2, truncation=True,
+                      padding=True, max_length=128, return_tensors="pt")
+    with torch.no_grad():
+        logits = model(**inputs).logits
+        raw_prob = torch.sigmoid(logits[0, 1]).item()
+    # Apply calibration for better confidence estimates
+    calibrated_prob = calibrator.predict_proba([[raw_prob]])[0, 1]
+    return calibrated_prob
+# Example
+prob = predict_duplicate("How to cook pasta?", "What's the best pasta recipe?")
+print(f"Calibrated duplicate probability: {prob:.3f}")
+```
+## Training Details
+### Training Data
+- **Dataset:** Quora Question Pairs (~400K question pairs)
+- **Split:** 90% training, 10% validation (stratified)
+- **Preprocessing:** Missing values filled with empty strings
+### Training Configuration
+- **Base Model:** microsoft/deberta-v3-base
+- **Architecture:** Cross-encoder with sequence classification head
+- **Max Length:** 128 tokens
+- **Batch Size:** 8 per device (with gradient accumulation)
+- **Learning Rate:** 2e-5
+- **Epochs:** 3
+- **Optimizer:** AdamW
+- **Precision:** FP16
+### Training Results
+| Epoch | Training Loss | Validation Loss | ROC AUC |
+|-------|---------------|-----------------|---------|
+| 1     | 0.219         | 0.211          | 0.972   |
+| 2     | 0.171         | 0.198          | 0.976   |
+| 3     | 0.116         | 0.214          | 0.976   |
+## Technical Details
+### Model Architecture
+- **Type:** Cross-encoder (both questions processed together)
+- **Advantage:** Higher accuracy than bi-encoder approaches
+- **Trade-off:** Slower inference than bi-encoders
+### Probability Calibration
+This model includes a calibration component that improves probability estimates:
+- **Method:** Logistic Regression on validation predictions
+- **Benefit:** More reliable confidence scores for production use
+- **File:** `deberta_cal.pkl` (included in repository)
+## Limitations and Bias
+**Limitations:**
+- Optimized for English question pairs only
+- Performance may degrade on very long questions (>128 tokens)
+- Training data reflects Quora user demographics and question patterns
+**Bias Considerations:**
+- Model inherits biases from DeBERTa base model and Quora dataset
+- May perform differently across question domains/topics
+- Evaluation primarily on question similarity, not general text
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{deberta-v3-quora-question-pairs,
+  title={DeBERTa-v3 for Quora Question Pairs Duplicate Detection},
+  author={Your Name},
+  year={2024},
+  url={https://huggingface.co/your-username/deberta-v3-quora-question-pairs}
+}
+```
+## Acknowledgments
+- Microsoft Research for DeBERTa-v3-base
+- Quora for the Question Pairs dataset
+- Hugging Face for the transformers library

added_tokens.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "[MASK]": 128000
+}

config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+  "architectures": [
+    "DebertaV2ForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-07,
+  "legacy": true,
+  "max_position_embeddings": 512,
+  "max_relative_positions": -1,
+  "model_type": "deberta-v2",
+  "norm_rel_ebd": "layer_norm",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "pooler_dropout": 0,
+  "pooler_hidden_act": "gelu",
+  "pooler_hidden_size": 768,
+  "pos_att_type": [
+    "p2c",
+    "c2p"
+  ],
+  "position_biased_input": false,
+  "position_buckets": 256,
+  "relative_attention": true,
+  "share_att_key": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.53.3",
+  "type_vocab_size": 0,
+  "vocab_size": 128100,
+  "id2label": {
+    "0": "not_duplicate",
+    "1": "duplicate"
+  },
+  "label2id": {
+    "not_duplicate": 0,
+    "duplicate": 1
+  },
+  "num_labels": 2,
+  "problem_type": "single_label_classification",
+  "finetuning_task": "question_pairs"
+}

deberta_cal.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1b303e694c316a1595aaeb52414b13c0dba15491443a8e0dbf1667d48bbb32f0
+size 879

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:47f4af56d91a0d4d8a74e77e94468f599017b01d6fdaeb7dbedb2af951995d26
+size 737719272

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "[CLS]",
+  "cls_token": "[CLS]",
+  "eos_token": "[SEP]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c679fbf93643d19aab7ee10c0b99e460bdbc02fedf34b92b05af343b4af586fd
+size 2464616

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128000": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "[CLS]",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "eos_token": "[SEP]",
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "sp_model_kwargs": {},
+  "split_by_punct": false,
+  "tokenizer_class": "DebertaV2Tokenizer",
+  "unk_token": "[UNK]",
+  "vocab_type": "spm"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d3baee82c0fc3aad76bdb0a283d1042dff3c0e48cda93ef0b96be57830a046a5
+size 5713