thebajajra commited on
Commit
9ffa1e9
·
verified ·
1 Parent(s): dec3b0c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +212 -92
README.md CHANGED
@@ -1,114 +1,230 @@
1
  ---
2
- library_name: transformers
3
- tags:
4
- - gemma3
5
- - gemma3_text
6
- - encoder
7
- - bidirectional
8
- - masked-language-modeling
9
- - text-embeddings
10
- - feature-extraction
11
- - custom_code
12
  license: mit
13
- base_model: thebajajra/Gemma3-270M-encoder
 
 
 
 
 
 
 
 
 
 
14
  pipeline_tag: fill-mask
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
- # gemma3-encoder-270m-mlm-euro
18
 
19
- A **bidirectional encoder** fine-tuned from [thebajajra/Gemma3-270M-encoder](https://huggingface.co/thebajajra/Gemma3-270M-encoder) with Masked Language Modeling (MLM).
 
 
 
20
 
21
- ## Model Description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- This model is a BERT-style bidirectional encoder based on Gemma 3 architecture:
24
- - Bidirectional attention (not causal)
25
- - Masked language modeling head (tied to input embeddings)
26
- - Trained with 15% token masking (BERT-style MLM)
27
 
28
- ### Architecture Details
 
 
29
 
30
- | Parameter | Value |
31
- |-----------|-------|
32
- | Base Model | [`thebajajra/Gemma3-270M-encoder`](https://huggingface.co/thebajajra/Gemma3-270M-encoder) |
33
- | Vocab Size | 262,145 |
34
- | Sliding Window | 512 |
35
- | Max Sequence Length | 2048 |
36
- | Attention | Bidirectional |
37
 
38
- ## Usage
 
39
 
40
- > **Note**: This model uses custom code, so you **must** include `trust_remote_code=True` when loading. This is a security feature that allows loading custom model classes.
 
 
41
 
42
- ### Loading the Model
 
 
 
 
 
 
 
 
 
43
 
44
  ```python
45
- from transformers import AutoTokenizer, AutoModelForMaskedLM
46
 
47
- tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
48
- model = AutoModelForMaskedLM.from_pretrained("gemma3-encoder-270m-mlm-euro", trust_remote_code=True)
49
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- ### Masked Language Modeling
 
 
 
 
 
 
 
 
 
 
 
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```python
54
  from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
55
 
56
- model = AutoModelForMaskedLM.from_pretrained("gemma3-encoder-270m-mlm-euro", trust_remote_code=True)
57
- tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
 
58
 
59
- fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
60
  fill("Best [MASK] headphones under $100.")
61
  ```
62
 
63
- ### Embeddings / Feature Extraction
64
-
65
  ```python
66
  import torch
67
  from transformers import AutoTokenizer, AutoModel
68
 
69
- tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
70
- model = AutoModel.from_pretrained("gemma3-encoder-270m-mlm-euro", trust_remote_code=True)
71
 
72
- texts = ["wireless mouse", "ergonomic mouse pad"]
73
- inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
74
 
75
  with torch.no_grad():
76
- outputs = model(**inputs)
77
- # Mean-pool last hidden state
78
- attn = inputs["attention_mask"].unsqueeze(-1)
79
- embeddings = (outputs.last_hidden_state * attn).sum(1) / attn.sum(1)
80
- # Normalize for cosine similarity
81
- embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
82
  ```
83
 
84
- ### Sentence-Transformers
85
-
86
- ```python
87
- from transformers import AutoModelForMaskedLM, AutoTokenizer
88
- from sentence_transformers import SentenceTransformer
89
-
90
- model_mlm = AutoModelForMaskedLM.from_pretrained("gemma3-encoder-270m-mlm-euro", trust_remote_code=True)
91
- encoder = model_mlm.encoder
92
- tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
93
-
94
- ENCODER_DIR = "encoder-only"
95
- encoder.save_pretrained(ENCODER_DIR)
96
- tokenizer.save_pretrained(ENCODER_DIR)
97
-
98
- model = SentenceTransformer(ENCODER_DIR)
99
- ```
100
-
101
- ### Text Classification Fine-tuning
102
-
103
  ```python
104
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
105
 
106
- tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
107
- model = AutoModelForSequenceClassification.from_pretrained(
108
- "gemma3-encoder-270m-mlm-euro",
109
- num_labels=NUM_LABELS,
110
- trust_remote_code=True
111
- )
112
 
113
  # Prepare your Dataset objects: train_ds, val_ds (text→label)
114
  args = TrainingArguments(
@@ -122,34 +238,38 @@ args = TrainingArguments(
122
  load_best_model_at_end=True,
123
  )
124
 
125
- trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=val_ds, tokenizer=tokenizer)
126
  trainer.train()
127
  ```
128
 
129
- ### Token Classification (NER/POS Tagging)
130
 
131
- ```python
132
- from transformers import AutoTokenizer, AutoModelForTokenClassification
133
 
134
- tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
135
- model = AutoModelForTokenClassification.from_pretrained(
136
- "gemma3-encoder-270m-mlm-euro",
137
- num_labels=NUM_LABELS, # e.g., number of NER tags
138
- trust_remote_code=True
139
- )
140
- ```
141
 
142
- ## Training
143
 
144
- This model was trained using MLM on packed sequences with:
145
- - Dynamic BERT-style token masking (15%)
146
- - AdamW optimizer with fused kernels
147
- - Mixed precision training
 
 
148
 
149
  ## License
150
 
151
- MIT License
 
 
 
152
 
153
- ## Citation
154
 
155
- If you use this model, please cite this repository.
 
1
  ---
 
 
 
 
 
 
 
 
 
 
2
  license: mit
3
+ language:
4
+ - en
5
+ - ru
6
+ - pt
7
+ - de
8
+ - it
9
+ - nl
10
+ - es
11
+ - fr
12
+ - uk
13
+ - pl
14
  pipeline_tag: fill-mask
15
+ library_name: transformers
16
+ tags:
17
+ - ecommerce
18
+ - e-commerce
19
+ - retail
20
+ - marketplace
21
+ - shopping
22
+ - amazon
23
+ - ebay
24
+ - alibaba
25
+ - google
26
+ - rakuten
27
+ - bestbuy
28
+ - walmart
29
+ - flipkart
30
+ - wayfair
31
+ - shein
32
+ - target
33
+ - etsy
34
+ - shopify
35
+ - taobao
36
+ - asos
37
+ - carrefour
38
+ - costco
39
+ - overstock
40
+ - pretraining
41
+ - encoder
42
+ - language-modeling
43
+ - foundation-model
44
+ datasets:
45
+ - thebajajra/Ecomniverse-euro
46
  ---
47
 
48
+ # RexGemma-Euro
49
 
50
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://mit-license.org)
51
+ [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-Models-red)](https://huggingface.co/collections/thebajajra/rexgemma)
52
+ [![Data](https://img.shields.io/badge/🤗%20Training%20Data-Ecomniverse-yellow)](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
53
+ [![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/bajajra/RexGemma)
54
 
55
+ > **TL;DR**: Gemma3-270M decoder converted into encoder with 2048 sequence length and 100M non-embedding parameters to power product search, attribute extraction, classification, and embeddings use cases. The model has been trained on 350B+ e-commerce-specific tokens
56
+
57
+ ---
58
+
59
+ ## Table of Contents
60
+ - [Quick Start](#quick-start)
61
+ - [Intended Uses & Limitations](#intended-uses--limitations)
62
+ - [Model Description](#model-description)
63
+ - [Training Recipe](#training-recipe)
64
+ - [Data Overview](#data-overview)
65
+ - [Evaluation](#evaluation)
66
+ - [Usage Examples](#usage-examples)
67
+ - [Masked language modeling](#1-masked-language-modeling)
68
+ - [Embeddings / feature extraction](#2-embeddings--feature-extraction)
69
+ - [Text classification fine-tune](#3-text-classification-fine-tune)
70
+ - [Model Architecture & Compatibility](#model-architecture--compatibility)
71
+ - [Efficiency & Deployment Tips](#efficiency--deployment-tips)
72
+ - [Responsible & Safe Use](#responsible--safe-use)
73
+ - [License](#license)
74
+ - [Maintainers & Contact](#maintainers--contact)
75
+ - [Citation](#citation)
76
+
77
+ ---
78
 
79
+ ## Quick Start
 
 
 
80
 
81
+ ```python
82
+ import torch
83
+ from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, pipeline
84
 
85
+ MODEL_ID = "thebajajra/RexGemma-Euro"
 
 
 
 
 
 
86
 
87
+ # Tokenizer
88
+ tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
89
 
90
+ # 1) Fill-Mask (if MLM head is present)
91
+ mlm = pipeline("fill-mask", model=MODEL_ID, tokenizer=tok)
92
+ print(mlm("These running shoes are great for [MASK] training."))
93
 
94
+ # 2) Feature extraction (CLS or mean-pooled embeddings)
95
+ enc = AutoModel.from_pretrained(MODEL_ID)
96
+ inputs = tok(["wireless mouse", "ergonomic mouse pad"], padding=True, truncation=True, return_tensors="pt")
97
+ with torch.no_grad():
98
+ out = enc(**inputs, output_hidden_states=True)
99
+ # Mean-pool last hidden state for sentence embeddings
100
+ emb = (out.last_hidden_state * inputs.attention_mask.unsqueeze(-1)).sum(dim=1) / inputs.attention_mask.sum(dim=1, keepdim=True)
101
+ ```
102
+
103
+ ### Sentence-Transformers
104
 
105
  ```python
 
106
 
 
 
107
  ```
108
+ ---
109
+
110
+ ## Intended Uses & Limitations
111
+
112
+ **Use cases**
113
+ - Product & query **retrieval/semantic search** (titles, descriptions, attributes)
114
+ - **Attribute extraction** / slot filling (brand, color, size, material)
115
+ - **Classification** (category assignment, unsafe/regulated item filtering, review sentiment)
116
+ - **Reranking** and **query understanding** (spelling/ASR normalization, acronym expansion)
117
+
118
+ **Out of scope**
119
+ - Long-form **generation** (use a decoder/seq-to-seq LM instead)
120
+ - High-stakes decisions without human review (pricing, compliance, safety flags)
121
+
122
+ **Target users**
123
+ - Search/recs engineers, e-commerce data teams, ML researchers working on domain-specific encoders
124
+
125
+ ---
126
+
127
+ ## Model Description
128
+
129
+ RexGemma-2048 is an **encoder-only**, 100M parameters transformer trained with a masked-language-modeling objective and optimized for **e-commerce related text**.
130
+
131
+ ---
132
+
133
+ ## Training Recipe
134
+
135
+
136
+
137
+ ---
138
+
139
+ ## Data Overview
140
+
141
+ - **Dataset:** [Ecom-niverse](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
142
+ - **Domain mix:**
143
 
144
+ We identified 9 E-commerce overlapping domains which have significant amount of relevant tokens but required filteration. Below is the domain list and their filtered size
145
+ | Domain | Size (GBs) |
146
+ |---|---|
147
+ | Hobby | 114 |
148
+ | News | 66 |
149
+ | Health | 66 |
150
+ | Entertainment | 64 |
151
+ | Travel | 52 |
152
+ | Food | 22 |
153
+ | Automotive | 19 |
154
+ | Sports | 12 |
155
+ | Music and Dance | 7 |
156
 
157
+ Additionally, there are 6 more domains which had almost complete overlap and were picked directly out of FineFineWeb.
158
+ | Domain | Size (GBs) |
159
+ |---|---|
160
+ | Fashion | 37 |
161
+ | Beauty | 37 |
162
+ | Celebrity | 28 |
163
+ | Movie | 26 |
164
+ | Photo | 15 |
165
+ | Painting | 2 |
166
+
167
+ By focusing on these domains, we narrow the search space to parts of the web data where shopping-related text is likely to appear. However, even within a chosen domain, not every item is actually about buying or selling, many may be informational articles, news, or unrelated discussions. Thus, a more fine-grained filtering within each domain is required to extract only the e-commerce-specific lines. We accomplish this by training lightweight classifiers per domain to distinguish e-commerce context vs. non-e-commerce content.
168
+
169
+
170
+
171
+ ---
172
+ ## Evaluation
173
+
174
+ <!-- ### Token Classification
175
+
176
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/DuUWO7SyzxJsN53dOSV60.png)
177
+
178
+ > With 2–3x fewer parameters, RexBERT surpasses the performance of the ModernBERT series.
179
+ -->
180
+ ### Semantic Similarity
181
+
182
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/YQPPRjz-BtIH_MgayQ7ZV.png)
183
+ **Used non-embedding parameters to plot RexGemma-2048
184
+ > RexGemma models outperform all the models in their parameter/size category including RexBERT family of models.
185
+
186
+
187
+ ---
188
+
189
+ ## Usage Examples
190
+
191
+ ### 1) Masked language modeling
192
  ```python
193
  from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
194
 
195
+ m = AutoModelForMaskedLM.from_pretrained("thebajajra/RexGemma-2048")
196
+ t = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
197
+ fill = pipeline("fill-mask", model=m, tokenizer=t)
198
 
 
199
  fill("Best [MASK] headphones under $100.")
200
  ```
201
 
202
+ ### 2) Embeddings / feature extraction
 
203
  ```python
204
  import torch
205
  from transformers import AutoTokenizer, AutoModel
206
 
207
+ tok = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
208
+ enc = AutoModel.from_pretrained("thebajajra/RexGemma-2048")
209
 
210
+ texts = ["nike air zoom pegasus 40", "running shoes pegasus zoom nike"]
211
+ batch = tok(texts, padding=True, truncation=True, return_tensors="pt")
212
 
213
  with torch.no_grad():
214
+ out = enc(**batch)
215
+ # Mean-pool last hidden state
216
+ attn = batch["attention_mask"].unsqueeze(-1)
217
+ emb = (out.last_hidden_state * attn).sum(1) / attn.sum(1)
218
+ # Normalize for cosine similarity (recommended for retrieval)
219
+ emb = torch.nn.functional.normalize(emb, p=2, dim=1)
220
  ```
221
 
222
+ ### 3) Text classification fine-tune
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
  ```python
224
  from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
225
 
226
+ tok = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
227
+ model = AutoModelForSequenceClassification.from_pretrained("thebajajra/RexGemma-2048", num_labels=NUM_LABELS)
 
 
 
 
228
 
229
  # Prepare your Dataset objects: train_ds, val_ds (text→label)
230
  args = TrainingArguments(
 
238
  load_best_model_at_end=True,
239
  )
240
 
241
+ trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=val_ds, tokenizer=tok)
242
  trainer.train()
243
  ```
244
 
245
+ ---
246
 
247
+ ## Model Architecture & Compatibility
 
248
 
249
+ - **Architecture:** Encoder-only, Gemma3-270M backbone model.
250
+ - **Libraries:** Works with **🤗 Transformers**; supports **fill-mask** and **feature-extraction** pipelines.
251
+ - **Context length:** Increased during the **Context Extension** phase—ensure `max_position_embeddings` in `config.json` matches your desired max length.
252
+ - **Files:** `config.json`, tokenizer files, and (optionally) heads for MLM or classification.
253
+ - **Export:** Standard PyTorch weights; you can export ONNX / TorchScript for production if needed.
254
+
255
+ ---
256
 
257
+ ## Responsible & Safe Use
258
 
259
+ - **Biases:** Commerce data can encode brand, price, and region biases; audit downstream classifiers/retrievers for disparate error rates across categories/regions.
260
+ - **Sensitive content:** Add filters for adult/regulated items; document moderation thresholds if you release classifiers.
261
+ - **Privacy:** Do not expose PII; ensure training data complies with terms and applicable laws.
262
+ - **Misuse:** This model is **not** a substitute for legal/compliance review for listings.
263
+
264
+ ---
265
 
266
  ## License
267
 
268
+ - **License:** `MIT`.
269
+ ---
270
+
271
+ ## Maintainers & Contact
272
 
273
+ - **Authors:** [Rahul Bajaj](https://huggingface.co/thebajajra)
274
 
275
+ ---