fatihburakkaragoz commited on
Commit
de8d860
·
verified ·
1 Parent(s): 3ddebe3

Initial model upload (model + tokenizer + config)

Browse files
README.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: microsoft/deberta-v3-base
4
+ tags:
5
+ - text-classification
6
+ - question-answering
7
+ - semantic-similarity
8
+ - quora
9
+ - duplicate-detection
10
+ - transformers
11
+ - pytorch
12
+ datasets:
13
+ - quora
14
+ language:
15
+ - en
16
+ metrics:
17
+ - roc_auc
18
+ model-index:
19
+ - name: deberta-v3-quora-question-pairs
20
+ results:
21
+ - task:
22
+ type: text-classification
23
+ name: Question Pair Duplicate Detection
24
+ dataset:
25
+ name: Quora Question Pairs
26
+ type: quora
27
+ metrics:
28
+ - type: roc_auc
29
+ value: 0.9759
30
+ name: ROC AUC
31
+ ---
32
+
33
+ # DeBERTa-v3 for Quora Question Pairs Duplicate Detection
34
+
35
+ A fine-tuned DeBERTa-v3-base model for identifying duplicate question pairs, achieving 97.59% ROC AUC on the Quora Question Pairs dataset.
36
+
37
+ ## Model Description
38
+
39
+ This model is a fine-tuned version of [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) on the Quora Question Pairs dataset. It uses a cross-encoder architecture to determine whether two questions are semantically equivalent.
40
+
41
+ **Key Features:**
42
+ - Cross-encoder architecture for superior accuracy
43
+ - Probability calibration for reliable confidence estimates
44
+ - Robust handling of missing/empty questions
45
+ - Production-ready inference pipeline
46
+
47
+ ## Performance
48
+
49
+ | Metric | Value |
50
+ |--------|-------|
51
+ | ROC AUC | 97.59% |
52
+ | Training Loss | 0.116 |
53
+ | Validation Loss | 0.214 |
54
+
55
+ ## Intended Use
56
+
57
+ **Primary Use Cases:**
58
+ - Question deduplication systems
59
+ - Semantic similarity detection
60
+ - Content moderation for duplicate questions
61
+ - Search and retrieval systems
62
+
63
+ **Out-of-Scope Use:**
64
+ - General text similarity (model is optimized for questions)
65
+ - Languages other than English
66
+ - Longer texts (trained on max 128 tokens)
67
+
68
+ ## Usage
69
+
70
+ ### Basic Inference
71
+
72
+ ```python
73
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
74
+ import torch
75
+
76
+ # Load model and tokenizer
77
+ model_name = "your-username/deberta-v3-quora-question-pairs"
78
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
79
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
80
+
81
+ # Example usage
82
+ question1 = "How do I learn Python programming?"
83
+ question2 = "What's the best way to learn Python?"
84
+
85
+ # Tokenize and predict
86
+ inputs = tokenizer(question1, question2,
87
+ truncation=True, padding=True,
88
+ max_length=128, return_tensors="pt")
89
+
90
+ with torch.no_grad():
91
+ outputs = model(**inputs)
92
+ logits = outputs.logits
93
+ probability = torch.softmax(logits, dim=-1)[0, 1].item()
94
+
95
+ print(f"Duplicate probability: {probability:.3f}")
96
+ ```
97
+
98
+ ### With Probability Calibration (Recommended)
99
+
100
+ For the most accurate confidence estimates, use the included calibrator:
101
+
102
+ ```python
103
+ import joblib
104
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
105
+ import torch
106
+
107
+ # Load model, tokenizer, and calibrator
108
+ model_name = "your-username/deberta-v3-quora-question-pairs"
109
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
110
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
111
+
112
+ # Note: Download the calibrator separately from the model repository
113
+ calibrator = joblib.load("deberta_cal.pkl")
114
+
115
+ def predict_duplicate(question1, question2):
116
+ # Get raw prediction
117
+ inputs = tokenizer(question1, question2, truncation=True,
118
+ padding=True, max_length=128, return_tensors="pt")
119
+
120
+ with torch.no_grad():
121
+ logits = model(**inputs).logits
122
+ raw_prob = torch.sigmoid(logits[0, 1]).item()
123
+
124
+ # Apply calibration for better confidence estimates
125
+ calibrated_prob = calibrator.predict_proba([[raw_prob]])[0, 1]
126
+ return calibrated_prob
127
+
128
+ # Example
129
+ prob = predict_duplicate("How to cook pasta?", "What's the best pasta recipe?")
130
+ print(f"Calibrated duplicate probability: {prob:.3f}")
131
+ ```
132
+
133
+ ## Training Details
134
+
135
+ ### Training Data
136
+ - **Dataset:** Quora Question Pairs (~400K question pairs)
137
+ - **Split:** 90% training, 10% validation (stratified)
138
+ - **Preprocessing:** Missing values filled with empty strings
139
+
140
+ ### Training Configuration
141
+ - **Base Model:** microsoft/deberta-v3-base
142
+ - **Architecture:** Cross-encoder with sequence classification head
143
+ - **Max Length:** 128 tokens
144
+ - **Batch Size:** 8 per device (with gradient accumulation)
145
+ - **Learning Rate:** 2e-5
146
+ - **Epochs:** 3
147
+ - **Optimizer:** AdamW
148
+ - **Precision:** FP16
149
+
150
+ ### Training Results
151
+
152
+ | Epoch | Training Loss | Validation Loss | ROC AUC |
153
+ |-------|---------------|-----------------|---------|
154
+ | 1 | 0.219 | 0.211 | 0.972 |
155
+ | 2 | 0.171 | 0.198 | 0.976 |
156
+ | 3 | 0.116 | 0.214 | 0.976 |
157
+
158
+ ## Technical Details
159
+
160
+ ### Model Architecture
161
+ - **Type:** Cross-encoder (both questions processed together)
162
+ - **Advantage:** Higher accuracy than bi-encoder approaches
163
+ - **Trade-off:** Slower inference than bi-encoders
164
+
165
+ ### Probability Calibration
166
+ This model includes a calibration component that improves probability estimates:
167
+ - **Method:** Logistic Regression on validation predictions
168
+ - **Benefit:** More reliable confidence scores for production use
169
+ - **File:** `deberta_cal.pkl` (included in repository)
170
+
171
+ ## Limitations and Bias
172
+
173
+ **Limitations:**
174
+ - Optimized for English question pairs only
175
+ - Performance may degrade on very long questions (>128 tokens)
176
+ - Training data reflects Quora user demographics and question patterns
177
+
178
+ **Bias Considerations:**
179
+ - Model inherits biases from DeBERTa base model and Quora dataset
180
+ - May perform differently across question domains/topics
181
+ - Evaluation primarily on question similarity, not general text
182
+
183
+ ## Citation
184
+
185
+ If you use this model, please cite:
186
+
187
+ ```bibtex
188
+ @misc{deberta-v3-quora-question-pairs,
189
+ title={DeBERTa-v3 for Quora Question Pairs Duplicate Detection},
190
+ author={Your Name},
191
+ year={2024},
192
+ url={https://huggingface.co/your-username/deberta-v3-quora-question-pairs}
193
+ }
194
+ ```
195
+
196
+ ## Acknowledgments
197
+
198
+ - Microsoft Research for DeBERTa-v3-base
199
+ - Quora for the Question Pairs dataset
200
+ - Hugging Face for the transformers library
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "[MASK]": 128000
3
+ }
config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "DebertaV2ForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "hidden_act": "gelu",
7
+ "hidden_dropout_prob": 0.1,
8
+ "hidden_size": 768,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 3072,
11
+ "layer_norm_eps": 1e-07,
12
+ "legacy": true,
13
+ "max_position_embeddings": 512,
14
+ "max_relative_positions": -1,
15
+ "model_type": "deberta-v2",
16
+ "norm_rel_ebd": "layer_norm",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 0,
20
+ "pooler_dropout": 0,
21
+ "pooler_hidden_act": "gelu",
22
+ "pooler_hidden_size": 768,
23
+ "pos_att_type": [
24
+ "p2c",
25
+ "c2p"
26
+ ],
27
+ "position_biased_input": false,
28
+ "position_buckets": 256,
29
+ "relative_attention": true,
30
+ "share_att_key": true,
31
+ "torch_dtype": "float32",
32
+ "transformers_version": "4.53.3",
33
+ "type_vocab_size": 0,
34
+ "vocab_size": 128100,
35
+ "id2label": {
36
+ "0": "not_duplicate",
37
+ "1": "duplicate"
38
+ },
39
+ "label2id": {
40
+ "not_duplicate": 0,
41
+ "duplicate": 1
42
+ },
43
+ "num_labels": 2,
44
+ "problem_type": "single_label_classification",
45
+ "finetuning_task": "question_pairs"
46
+ }
deberta_cal.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b303e694c316a1595aaeb52414b13c0dba15491443a8e0dbf1667d48bbb32f0
3
+ size 879
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:47f4af56d91a0d4d8a74e77e94468f599017b01d6fdaeb7dbedb2af951995d26
3
+ size 737719272
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "[CLS]",
3
+ "cls_token": "[CLS]",
4
+ "eos_token": "[SEP]",
5
+ "mask_token": "[MASK]",
6
+ "pad_token": "[PAD]",
7
+ "sep_token": "[SEP]",
8
+ "unk_token": {
9
+ "content": "[UNK]",
10
+ "lstrip": false,
11
+ "normalized": true,
12
+ "rstrip": false,
13
+ "single_word": false
14
+ }
15
+ }
spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c679fbf93643d19aab7ee10c0b99e460bdbc02fedf34b92b05af343b4af586fd
3
+ size 2464616
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[CLS]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[SEP]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[UNK]",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "128000": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "[CLS]",
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "[CLS]",
47
+ "do_lower_case": false,
48
+ "eos_token": "[SEP]",
49
+ "extra_special_tokens": {},
50
+ "mask_token": "[MASK]",
51
+ "model_max_length": 1000000000000000019884624838656,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "sp_model_kwargs": {},
55
+ "split_by_punct": false,
56
+ "tokenizer_class": "DebertaV2Tokenizer",
57
+ "unk_token": "[UNK]",
58
+ "vocab_type": "spm"
59
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3baee82c0fc3aad76bdb0a283d1042dff3c0e48cda93ef0b96be57830a046a5
3
+ size 5713