Update README.md
Browse files
README.md
CHANGED
|
@@ -22,7 +22,6 @@ datasets:
|
|
| 22 |
|
| 23 |
# IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data
|
| 24 |
|
| 25 |
-
|
| 26 |
## Overview
|
| 27 |
**IndicPhi-mini** is a fine-tuned version of **Microsoft’s Phi-mini-MoE**, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as **QLoRA-based quantization** and **LoRA adapters**, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent **3–4% accuracy** improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data.
|
| 28 |
a compact Mixture-of-Experts (MoE) model
|
|
@@ -107,7 +106,6 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|
| 107 |
|
| 108 |
**Total curated dataset:** ~29 million high-quality samples
|
| 109 |
|
| 110 |
-
|
| 111 |
---
|
| 112 |
|
| 113 |
### Training Details
|
|
@@ -154,7 +152,6 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|
| 154 |
|
| 155 |
Accuracy: **(Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)**
|
| 156 |
|
| 157 |
-
|
| 158 |
**MMLU-Indic**
|
| 159 |
|
| 160 |
| Language | Accuracy (Phi-mini-MoE) | Accuracy (Phi-mini-MoE)|
|
|
@@ -175,14 +172,14 @@ Accuracy: **(Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)**
|
|
| 175 |
|
| 176 |
Accuracy: **(Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)**
|
| 177 |
|
| 178 |
-
|
| 179 |
-
|
| 180 |
## Acknowledgments
|
| 181 |
|
| 182 |
-
The **Phi-mini-MoE-Instruct** models are based on the original work by **Microsoft** and
|
| 183 |
|
| 184 |
Special thanks to:
|
| 185 |
- The [Microsoft](https://huggingface.co/microsoft) team for developing and releasing the [microsoft/Phi-mini-MoE-instruct](https://huggingface.co/microsoft/Phi-mini-MoE-instruct) model.
|
|
|
|
|
|
|
| 186 |
|
| 187 |
---
|
| 188 |
|
|
|
|
| 22 |
|
| 23 |
# IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data
|
| 24 |
|
|
|
|
| 25 |
## Overview
|
| 26 |
**IndicPhi-mini** is a fine-tuned version of **Microsoft’s Phi-mini-MoE**, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as **QLoRA-based quantization** and **LoRA adapters**, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent **3–4% accuracy** improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data.
|
| 27 |
a compact Mixture-of-Experts (MoE) model
|
|
|
|
| 106 |
|
| 107 |
**Total curated dataset:** ~29 million high-quality samples
|
| 108 |
|
|
|
|
| 109 |
---
|
| 110 |
|
| 111 |
### Training Details
|
|
|
|
| 152 |
|
| 153 |
Accuracy: **(Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)**
|
| 154 |
|
|
|
|
| 155 |
**MMLU-Indic**
|
| 156 |
|
| 157 |
| Language | Accuracy (Phi-mini-MoE) | Accuracy (Phi-mini-MoE)|
|
|
|
|
| 172 |
|
| 173 |
Accuracy: **(Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)**
|
| 174 |
|
|
|
|
|
|
|
| 175 |
## Acknowledgments
|
| 176 |
|
| 177 |
+
The **Phi-mini-MoE-Instruct** models are based on the original work by **Microsoft** and fine-tuned by the **Sandlogic** development team.
|
| 178 |
|
| 179 |
Special thanks to:
|
| 180 |
- The [Microsoft](https://huggingface.co/microsoft) team for developing and releasing the [microsoft/Phi-mini-MoE-instruct](https://huggingface.co/microsoft/Phi-mini-MoE-instruct) model.
|
| 181 |
+
- The authors and organizations behind the **53 open-source datasets** that made this work possible.
|
| 182 |
+
The complete list of dataset sources and citations is available [here](https://github.com/sandlogic/SandLogic-Lexicons/blob/main/Images/dataset_citation.md).
|
| 183 |
|
| 184 |
---
|
| 185 |
|