thebajajra
/

RexGemma-Euro

@@ -1,114 +1,230 @@
 ---
-library_name: transformers
-tags:
-  - gemma3
-  - gemma3_text
-  - encoder
-  - bidirectional
-  - masked-language-modeling
-  - text-embeddings
-  - feature-extraction
-  - custom_code
 license: mit
-base_model: thebajajra/Gemma3-270M-encoder
 pipeline_tag: fill-mask
 ---
-# gemma3-encoder-270m-mlm-euro
-A **bidirectional encoder** fine-tuned from [thebajajra/Gemma3-270M-encoder](https://huggingface.co/thebajajra/Gemma3-270M-encoder) with Masked Language Modeling (MLM).
-## Model Description
-This model is a BERT-style bidirectional encoder based on Gemma 3 architecture:
-- Bidirectional attention (not causal)
-- Masked language modeling head (tied to input embeddings)
-- Trained with 15% token masking (BERT-style MLM)
-### Architecture Details
-| Parameter | Value |
-|-----------|-------|
-| Base Model | [`thebajajra/Gemma3-270M-encoder`](https://huggingface.co/thebajajra/Gemma3-270M-encoder) |
-| Vocab Size | 262,145 |
-| Sliding Window | 512 |
-| Max Sequence Length | 2048 |
-| Attention | Bidirectional |
-## Usage
-> **Note**: This model uses custom code, so you **must** include `trust_remote_code=True` when loading. This is a security feature that allows loading custom model classes.
-### Loading the Model
 ```python
-from transformers import AutoTokenizer, AutoModelForMaskedLM
-tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
-model = AutoModelForMaskedLM.from_pretrained("gemma3-encoder-270m-mlm-euro", trust_remote_code=True)
 ```
-### Masked Language Modeling
 ```python
 from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
-model = AutoModelForMaskedLM.from_pretrained("gemma3-encoder-270m-mlm-euro", trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
-fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
 fill("Best [MASK] headphones under $100.")
 ```
-### Embeddings / Feature Extraction
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModel
-tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
-model = AutoModel.from_pretrained("gemma3-encoder-270m-mlm-euro", trust_remote_code=True)
-texts = ["wireless mouse", "ergonomic mouse pad"]
-inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
 with torch.no_grad():
-    outputs = model(**inputs)
-    # Mean-pool last hidden state
-    attn = inputs["attention_mask"].unsqueeze(-1)
-    embeddings = (outputs.last_hidden_state * attn).sum(1) / attn.sum(1)
-    # Normalize for cosine similarity
-    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
 ```
-### Sentence-Transformers
-```python
-from transformers import AutoModelForMaskedLM, AutoTokenizer
-from sentence_transformers import SentenceTransformer
-model_mlm = AutoModelForMaskedLM.from_pretrained("gemma3-encoder-270m-mlm-euro", trust_remote_code=True)
-encoder = model_mlm.encoder
-tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
-ENCODER_DIR = "encoder-only"
-encoder.save_pretrained(ENCODER_DIR)
-tokenizer.save_pretrained(ENCODER_DIR)
-model = SentenceTransformer(ENCODER_DIR)
-```
-### Text Classification Fine-tuning
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
-tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
-model = AutoModelForSequenceClassification.from_pretrained(
-    "gemma3-encoder-270m-mlm-euro",
-    num_labels=NUM_LABELS,
-    trust_remote_code=True
-)
 # Prepare your Dataset objects: train_ds, val_ds (text→label)
 args = TrainingArguments(
@@ -122,34 +238,38 @@ args = TrainingArguments(
     load_best_model_at_end=True,
 )
-trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=val_ds, tokenizer=tokenizer)
 trainer.train()
 ```
-### Token Classification (NER/POS Tagging)
-```python
-from transformers import AutoTokenizer, AutoModelForTokenClassification
-tokenizer = AutoTokenizer.from_pretrained("gemma3-encoder-270m-mlm-euro")
-model = AutoModelForTokenClassification.from_pretrained(
-    "gemma3-encoder-270m-mlm-euro",
-    num_labels=NUM_LABELS,  # e.g., number of NER tags
-    trust_remote_code=True
-)
-```
-## Training
-This model was trained using MLM on packed sequences with:
-- Dynamic BERT-style token masking (15%)
-- AdamW optimizer with fused kernels
-- Mixed precision training
 ## License
-MIT License
-## Citation
-If you use this model, please cite this repository.

 ---
 license: mit
+language:
+- en
+- ru
+- pt
+- de
+- it
+- nl
+- es
+- fr
+- uk
+- pl
 pipeline_tag: fill-mask
+library_name: transformers
+tags:
+- ecommerce
+- e-commerce
+- retail
+- marketplace
+- shopping
+- amazon
+- ebay
+- alibaba
+- google
+- rakuten
+- bestbuy
+- walmart
+- flipkart
+- wayfair
+- shein
+- target
+- etsy
+- shopify
+- taobao
+- asos
+- carrefour
+- costco
+- overstock
+- pretraining
+- encoder
+- language-modeling
+- foundation-model
+datasets:
+- thebajajra/Ecomniverse-euro
 ---
+# RexGemma-Euro
+[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://mit-license.org)
+[![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-Models-red)](https://huggingface.co/collections/thebajajra/rexgemma)
+[![Data](https://img.shields.io/badge/🤗%20Training%20Data-Ecomniverse-yellow)](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
+[![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/bajajra/RexGemma)
+> **TL;DR**: Gemma3-270M decoder converted into encoder with 2048 sequence length and 100M non-embedding parameters to power product search, attribute extraction, classification, and embeddings use cases. The model has been trained on 350B+ e-commerce-specific tokens
+---
+## Table of Contents
+- [Quick Start](#quick-start)
+- [Intended Uses & Limitations](#intended-uses--limitations)
+- [Model Description](#model-description)
+- [Training Recipe](#training-recipe)
+- [Data Overview](#data-overview)
+- [Evaluation](#evaluation)
+- [Usage Examples](#usage-examples)
+  - [Masked language modeling](#1-masked-language-modeling)
+  - [Embeddings / feature extraction](#2-embeddings--feature-extraction)
+  - [Text classification fine-tune](#3-text-classification-fine-tune)
+- [Model Architecture & Compatibility](#model-architecture--compatibility)
+- [Efficiency & Deployment Tips](#efficiency--deployment-tips)
+- [Responsible & Safe Use](#responsible--safe-use)
+- [License](#license)
+- [Maintainers & Contact](#maintainers--contact)
+- [Citation](#citation)
+---
+## Quick Start
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, pipeline
+MODEL_ID = "thebajajra/RexGemma-Euro"
+# Tokenizer
+tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
+# 1) Fill-Mask (if MLM head is present)
+mlm = pipeline("fill-mask", model=MODEL_ID, tokenizer=tok)
+print(mlm("These running shoes are great for [MASK] training."))
+# 2) Feature extraction (CLS or mean-pooled embeddings)
+enc = AutoModel.from_pretrained(MODEL_ID)
+inputs = tok(["wireless mouse", "ergonomic mouse pad"], padding=True, truncation=True, return_tensors="pt")
+with torch.no_grad():
+    out = enc(**inputs, output_hidden_states=True)
+# Mean-pool last hidden state for sentence embeddings
+emb = (out.last_hidden_state * inputs.attention_mask.unsqueeze(-1)).sum(dim=1) / inputs.attention_mask.sum(dim=1, keepdim=True)
+```
+### Sentence-Transformers
 ```python
 ```
+---
+## Intended Uses & Limitations
+**Use cases**
+- Product & query **retrieval/semantic search** (titles, descriptions, attributes)
+- **Attribute extraction** / slot filling (brand, color, size, material)
+- **Classification** (category assignment, unsafe/regulated item filtering, review sentiment)
+- **Reranking** and **query understanding** (spelling/ASR normalization, acronym expansion)
+**Out of scope**
+- Long-form **generation** (use a decoder/seq-to-seq LM instead)
+- High-stakes decisions without human review (pricing, compliance, safety flags)
+**Target users**
+- Search/recs engineers, e-commerce data teams, ML researchers working on domain-specific encoders
+---
+## Model Description
+RexGemma-2048 is an **encoder-only**, 100M parameters transformer trained with a masked-language-modeling objective and optimized for **e-commerce related text**.
+---
+## Training Recipe
+---
+## Data Overview
+- **Dataset:** [Ecom-niverse](https://huggingface.co/datasets/thebajajra/Ecom-niverse)
+- **Domain mix:**
+We identified 9 E-commerce overlapping domains which have significant amount of relevant tokens but required filteration. Below is the domain list and their filtered size
+| Domain | Size (GBs) |
+|---|---|
+| Hobby | 114 |
+| News | 66 |
+| Health | 66 |
+| Entertainment | 64 |
+| Travel | 52 |
+| Food | 22 |
+| Automotive | 19 |
+| Sports | 12 |
+| Music and Dance | 7 |
+Additionally, there are 6 more domains which had almost complete overlap and were picked directly out of FineFineWeb.
+| Domain | Size (GBs) |
+|---|---|
+| Fashion | 37 |
+| Beauty | 37 |
+| Celebrity | 28 |
+| Movie | 26 |
+| Photo | 15 |
+| Painting | 2 |
+By focusing on these domains, we narrow the search space to parts of the web data where shopping-related text is likely to appear. However, even within a chosen domain, not every item is actually about buying or selling, many may be informational articles, news, or unrelated discussions. Thus, a more fine-grained filtering within each domain is required to extract only the e-commerce-specific lines. We accomplish this by training lightweight classifiers per domain to distinguish e-commerce context vs. non-e-commerce content.
+---
+## Evaluation
+<!-- ### Token Classification
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/DuUWO7SyzxJsN53dOSV60.png)
+> With 2–3x fewer parameters, RexBERT surpasses the performance of the ModernBERT series.
+-->
+### Semantic Similarity
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6893dd21467f7d2f5f358a95/YQPPRjz-BtIH_MgayQ7ZV.png)
+**Used non-embedding parameters to plot RexGemma-2048
+> RexGemma models outperform all the models in their parameter/size category including RexBERT family of models.
+---
+## Usage Examples
+### 1) Masked language modeling
 ```python
 from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
+m = AutoModelForMaskedLM.from_pretrained("thebajajra/RexGemma-2048")
+t = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
+fill = pipeline("fill-mask", model=m, tokenizer=t)
 fill("Best [MASK] headphones under $100.")
 ```
+### 2) Embeddings / feature extraction
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModel
+tok = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
+enc = AutoModel.from_pretrained("thebajajra/RexGemma-2048")
+texts = ["nike air zoom pegasus 40", "running shoes pegasus zoom nike"]
+batch = tok(texts, padding=True, truncation=True, return_tensors="pt")
 with torch.no_grad():
+    out = enc(**batch)
+# Mean-pool last hidden state
+attn = batch["attention_mask"].unsqueeze(-1)
+emb = (out.last_hidden_state * attn).sum(1) / attn.sum(1)
+# Normalize for cosine similarity (recommended for retrieval)
+emb = torch.nn.functional.normalize(emb, p=2, dim=1)
 ```
+### 3) Text classification fine-tune
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
+tok = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
+model = AutoModelForSequenceClassification.from_pretrained("thebajajra/RexGemma-2048", num_labels=NUM_LABELS)
 # Prepare your Dataset objects: train_ds, val_ds (text→label)
 args = TrainingArguments(
     load_best_model_at_end=True,
 )
+trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=val_ds, tokenizer=tok)
 trainer.train()
 ```
+---
+## Model Architecture & Compatibility
+- **Architecture:** Encoder-only, Gemma3-270M backbone model.
+- **Libraries:** Works with **🤗 Transformers**; supports **fill-mask** and **feature-extraction** pipelines.
+- **Context length:** Increased during the **Context Extension** phase—ensure `max_position_embeddings` in `config.json` matches your desired max length.
+- **Files:** `config.json`, tokenizer files, and (optionally) heads for MLM or classification.
+- **Export:** Standard PyTorch weights; you can export ONNX / TorchScript for production if needed.
+---
+## Responsible & Safe Use
+- **Biases:** Commerce data can encode brand, price, and region biases; audit downstream classifiers/retrievers for disparate error rates across categories/regions.
+- **Sensitive content:** Add filters for adult/regulated items; document moderation thresholds if you release classifiers.
+- **Privacy:** Do not expose PII; ensure training data complies with terms and applicable laws.
+- **Misuse:** This model is **not** a substitute for legal/compliance review for listings.
+---
 ## License
+- **License:** `MIT`.
+---
+## Maintainers & Contact
+- **Authors:** [Rahul Bajaj](https://huggingface.co/thebajajra)
+---