File size: 7,608 Bytes
df4685a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9858d2d
 
 
 
 
 
2484bf5
9858d2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df4685a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f17aff
df4685a
 
 
4f17aff
 
df4685a
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
license: mit
language:
- en
- hi
- kn
- te
- ta
- mr
base_model:
- microsoft/Phi-mini-MoE-instruct
library: transformers
pipeline_tag: text-generation
tags:
- Conversational
- Indic Dataset
- Multilingual
- MoE
datasets:
- SandLogicTechnologies/Indic_Chat_Dataset
---

# IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data

##  Overview
**IndicPhi-mini** is a fine-tuned version of **Microsoft’s Phi-mini-MoE**, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as **QLoRA-based quantization** and **LoRA adapters**, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent **3–4% accuracy** improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data.
a compact Mixture-of-Experts (MoE) model

---

##  Key Contributions
-  Curated one of the **largest Indic corpora** to date: 561M samples → cleaned into **29M high-quality samples** across **13 Indic languages**.  
-  Fine-tuned **Phi-mini-MoE** (7.6B params, 2.4B active) using **QLoRA (4-bit)** and **LoRA adapters**, making training feasible on a single **A100-80GB GPU**.  
-  Achieved **+3–4% accuracy improvements** on major Indic benchmarks:
    - **ARC-Challenge-Indic** (reasoning tasks)  
    - **MMLU-Indic** (knowledge & domain understanding)  
-  Improved **generalization across multiple Indic languages** including Hindi, Kannada, Tamil, Telugu, Marathi, Bengali, Malayalam, Gujarati, Odia, Punjabi, Assamese, Sinhala, and Urdu.  

---

##  Model Architecture
- **Base model:** Phi-mini-MoE-Instruct (Microsoft)  
- **Parameters:** 7.6B total (2.4B active per token)  
- **Layers:** 32 decoder-only transformer blocks  
- **Attention:** Grouped Query Attention (GQA)  
- **Experts per layer:** 16 (Top-2 active per token)  
- **Context length:** 4096 tokens 

---

## Usage
To load the fine-tuned model:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "SandLogicTechnologies/IndicPhi-mini"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_4bit=True
)

prompt = "ग्रामीण क्षेत्रों में ऑनलाइन शिक्षा की समस्याएं क्या हैं?"  

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```

##  Dataset Preparation
### Data Sources
- **Total collected:** 561M samples from **53 datasets** from Hugging Face.  
- **Languages covered:** 13 Indian languages which include Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali,Odia, Punjabi, Assamese, Sinhala, Urdu.  
- **Categories:** General text, translation, instruction, conversational.  

### Processing Pipeline
1. **Manual Filtering** – removed noisy, irrelevant, and malformed samples.  
2. **Preprocessing** – deduplication, language identification, normalization, minimum length filtering.  
3. **Format Conversion** – standardized into **UltraChat JSON schema** (multi-turn conversations).  

### Final Cleaned Dataset
- **Size:** 29M samples  

### Dataset Distribution (Final Cleaned)

| Language   | Samples   |
|------------|-----------|
| Hindi      | 4.63M     |
| Kannada    | 3.54M     |
| Telugu     | 3.72M     |
| Tamil      | 3.86M     |
| Marathi    | 3.79M     |
| Malayalam  | 2.81M     |
| Gujarati   | 2.94M     |
| Bengali    | 1.82M     |
| Odia       | 438K      |
| Punjabi    | 1.21M     |
| Assamese   | 185K      |
| Sinhala    | 64K       |
| Urdu       | 58K       |

**Total curated dataset:** ~29 million high-quality samples

---

### Training Details
- **Hardware:** 1 × NVIDIA A100-80GB  
- **Precision:** QLoRA (4-bit quantization)  
- **Batching:** Effective batch size 256 (32 × 8 gradient accumulation)  
- **Steps:** 8,500  
- **Optimizer:** AdamW (8-bit) + cosine LR schedule + 1k warmup steps  
- **LoRA configuration:**  
  - Layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj  
  - r=128, α=128, dropout=0  
- **Final training loss:** 0.48  

---

##  Evaluation & Results

### Benchmarks
1. **ARC-Challenge-Indic** (reasoning)  
2. **MMLU-Indic** (knowledge & domain understanding)  

### Improvements
- **ARC-Challenge-Indic**
  - Accuracy: **21.03 → 24.46 (+3.43%)**  
  - Normalized Accuracy: **24.69 → 28.86 (+4.17%)**  
- **MMLU-Indic**
  - Accuracy: **27.47 → 30.95 (+3.48%)**  

###  Results

#### ARC-Challenge-Indic

| Language   | Accuracy (Phi-mini-MoE) | Accuracy (IndicPhi-mini) |
|------------|-------------------------|--------------------------|
| Hindi      | 22.61                   | 26.17                    |
| Kannada    | 20.96                   | 25.83                    | 
| Tamil      | 20.78                   | 24.61                    |
| Telugu     | 20.70                   | 26.00                    | 
| Bengali    | 21.91                   | 25.04                    | 
| Gujarati   | 18.17                   | 21.30                    | 
| Malayalam  | 22.26                   | 23.91                    | 
| Marathi    | 19.65                   | 25.22                    |
| Odia       | 22.26                   | 24.17                    |

Accuracy: **(Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)**  

**MMLU-Indic**

| Language   | Accuracy (Phi-mini-MoE) | Accuracy (Phi-mini-MoE)|
|------------|-------------------------|-------------------------|
| Hindi      | 28.01                   | 31.45                   |
| Kannada    | 26.74                   | 30.12                   |
| Tamil      | 27.53                   | 30.84                   |
| Telugu     | 27.20                   | 31.02                   |
| Bengali    | 28.36                   | 31.44                   |
| Gujarati   | 25.91                   | 29.28                   |
| Malayalam  | 26.65                   | 29.77                   |
| Marathi    | 27.12                   | 30.63                   |
| Odia       | 27.05                   | 30.45                   |
| Punjabi    | 26.42                   | 29.61                   |
| Assamese   | 25.98                   | 29.23                   |
| Sinhala    | 24.87                   | 27.66                   |
| Urdu       | 25.44                   | 28.71                   |

Accuracy: **(Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)**

## Acknowledgments

The **Phi-mini-MoE-Instruct** models are based on the original work by **Microsoft** and fine-tuned by the **Sandlogic** development team.

Special thanks to:
- The [Microsoft](https://huggingface.co/microsoft) team for developing and releasing the [microsoft/Phi-mini-MoE-instruct](https://huggingface.co/microsoft/Phi-mini-MoE-instruct) model.
- The authors and organizations behind the **53 open-source datasets** that made this work possible.  
  The complete list of dataset sources and citations is available [here](https://github.com/sandlogic/SandLogic-Lexicons/blob/main/Images/dataset_citation.md).

---

## Contact
For any inquiries or support, please contact us at [email protected] or visit our [Website](https://www.sandlogic.com/).