File size: 7,608 Bytes
df4685a 9858d2d 2484bf5 9858d2d df4685a 4f17aff df4685a 4f17aff df4685a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
---
license: mit
language:
- en
- hi
- kn
- te
- ta
- mr
base_model:
- microsoft/Phi-mini-MoE-instruct
library: transformers
pipeline_tag: text-generation
tags:
- Conversational
- Indic Dataset
- Multilingual
- MoE
datasets:
- SandLogicTechnologies/Indic_Chat_Dataset
---
# IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data
## Overview
**IndicPhi-mini** is a fine-tuned version of **Microsoft’s Phi-mini-MoE**, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as **QLoRA-based quantization** and **LoRA adapters**, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent **3–4% accuracy** improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data.
a compact Mixture-of-Experts (MoE) model
---
## Key Contributions
- Curated one of the **largest Indic corpora** to date: 561M samples → cleaned into **29M high-quality samples** across **13 Indic languages**.
- Fine-tuned **Phi-mini-MoE** (7.6B params, 2.4B active) using **QLoRA (4-bit)** and **LoRA adapters**, making training feasible on a single **A100-80GB GPU**.
- Achieved **+3–4% accuracy improvements** on major Indic benchmarks:
- **ARC-Challenge-Indic** (reasoning tasks)
- **MMLU-Indic** (knowledge & domain understanding)
- Improved **generalization across multiple Indic languages** including Hindi, Kannada, Tamil, Telugu, Marathi, Bengali, Malayalam, Gujarati, Odia, Punjabi, Assamese, Sinhala, and Urdu.
---
## Model Architecture
- **Base model:** Phi-mini-MoE-Instruct (Microsoft)
- **Parameters:** 7.6B total (2.4B active per token)
- **Layers:** 32 decoder-only transformer blocks
- **Attention:** Grouped Query Attention (GQA)
- **Experts per layer:** 16 (Top-2 active per token)
- **Context length:** 4096 tokens
---
## Usage
To load the fine-tuned model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "SandLogicTechnologies/IndicPhi-mini"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
load_in_4bit=True
)
prompt = "ग्रामीण क्षेत्रों में ऑनलाइन शिक्षा की समस्याएं क्या हैं?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Dataset Preparation
### Data Sources
- **Total collected:** 561M samples from **53 datasets** from Hugging Face.
- **Languages covered:** 13 Indian languages which include Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali,Odia, Punjabi, Assamese, Sinhala, Urdu.
- **Categories:** General text, translation, instruction, conversational.
### Processing Pipeline
1. **Manual Filtering** – removed noisy, irrelevant, and malformed samples.
2. **Preprocessing** – deduplication, language identification, normalization, minimum length filtering.
3. **Format Conversion** – standardized into **UltraChat JSON schema** (multi-turn conversations).
### Final Cleaned Dataset
- **Size:** 29M samples
### Dataset Distribution (Final Cleaned)
| Language | Samples |
|------------|-----------|
| Hindi | 4.63M |
| Kannada | 3.54M |
| Telugu | 3.72M |
| Tamil | 3.86M |
| Marathi | 3.79M |
| Malayalam | 2.81M |
| Gujarati | 2.94M |
| Bengali | 1.82M |
| Odia | 438K |
| Punjabi | 1.21M |
| Assamese | 185K |
| Sinhala | 64K |
| Urdu | 58K |
**Total curated dataset:** ~29 million high-quality samples
---
### Training Details
- **Hardware:** 1 × NVIDIA A100-80GB
- **Precision:** QLoRA (4-bit quantization)
- **Batching:** Effective batch size 256 (32 × 8 gradient accumulation)
- **Steps:** 8,500
- **Optimizer:** AdamW (8-bit) + cosine LR schedule + 1k warmup steps
- **LoRA configuration:**
- Layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- r=128, α=128, dropout=0
- **Final training loss:** 0.48
---
## Evaluation & Results
### Benchmarks
1. **ARC-Challenge-Indic** (reasoning)
2. **MMLU-Indic** (knowledge & domain understanding)
### Improvements
- **ARC-Challenge-Indic**
- Accuracy: **21.03 → 24.46 (+3.43%)**
- Normalized Accuracy: **24.69 → 28.86 (+4.17%)**
- **MMLU-Indic**
- Accuracy: **27.47 → 30.95 (+3.48%)**
### Results
#### ARC-Challenge-Indic
| Language | Accuracy (Phi-mini-MoE) | Accuracy (IndicPhi-mini) |
|------------|-------------------------|--------------------------|
| Hindi | 22.61 | 26.17 |
| Kannada | 20.96 | 25.83 |
| Tamil | 20.78 | 24.61 |
| Telugu | 20.70 | 26.00 |
| Bengali | 21.91 | 25.04 |
| Gujarati | 18.17 | 21.30 |
| Malayalam | 22.26 | 23.91 |
| Marathi | 19.65 | 25.22 |
| Odia | 22.26 | 24.17 |
Accuracy: **(Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)**
**MMLU-Indic**
| Language | Accuracy (Phi-mini-MoE) | Accuracy (Phi-mini-MoE)|
|------------|-------------------------|-------------------------|
| Hindi | 28.01 | 31.45 |
| Kannada | 26.74 | 30.12 |
| Tamil | 27.53 | 30.84 |
| Telugu | 27.20 | 31.02 |
| Bengali | 28.36 | 31.44 |
| Gujarati | 25.91 | 29.28 |
| Malayalam | 26.65 | 29.77 |
| Marathi | 27.12 | 30.63 |
| Odia | 27.05 | 30.45 |
| Punjabi | 26.42 | 29.61 |
| Assamese | 25.98 | 29.23 |
| Sinhala | 24.87 | 27.66 |
| Urdu | 25.44 | 28.71 |
Accuracy: **(Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)**
## Acknowledgments
The **Phi-mini-MoE-Instruct** models are based on the original work by **Microsoft** and fine-tuned by the **Sandlogic** development team.
Special thanks to:
- The [Microsoft](https://huggingface.co/microsoft) team for developing and releasing the [microsoft/Phi-mini-MoE-instruct](https://huggingface.co/microsoft/Phi-mini-MoE-instruct) model.
- The authors and organizations behind the **53 open-source datasets** that made this work possible.
The complete list of dataset sources and citations is available [here](https://github.com/sandlogic/SandLogic-Lexicons/blob/main/Images/dataset_citation.md).
---
## Contact
For any inquiries or support, please contact us at [email protected] or visit our [Website](https://www.sandlogic.com/). |