---
license: mit
language:
- en
- bn
- hi
- mr
- ta
- te
- ml
- pa
- gu
- or
base_model:
- ai4bharat/indictrans2-en-indic-1B
pipeline_tag: translation
metrics:
- bleu
- google_bleu
- chrf++
inference: false
datasets:
- MILPaC
tags:
- InLegalTrans
- Legal
- NLP
---

# InLegalTrans

This is the model card of ***InLegalTrans-En2Indic-1B*** translation model, a fine-tuned version of the [IndicTrans2](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B) model specifically tailored for translating Indian legal texts from English to Indian languages.

### Training Data

We use the [**MILPaC**](https://github.com/Law-AI/MILPaC) **(Multilingual Indian Legal Parallel Corpus)** corpus for fine-tuning. It is the first high-quality Indian legal parallel corpus, containing parallel aligned text units in English (EN) and nine Indian (IN) languages -- Bengali (BN), Hindi (HI), Marathi (MR), Tamil (TA), Telugu (TE), Malayalam (ML), Panjabi (PA), Gujarati (GU), and Oriya (OR). Please refer to the [paper](https://arxiv.org/abs/2310.09765) for more details about this corpus. 

For fine-tuning, we randomly split MILPaC language-wise in a 80 (train) - 10 (validation) - 10 (test) ratio. We use the 80\% train split (combined 80\% of each English-to-Indic language pair) to fine-tune the [IndicTrans2](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B) model and 10\% validation split (combined 10\% of each English-to-Indic language pair) to select the best checkpoint and to prevent overfitting. 


### Model Overview and Usage Instructions

This [InLegalTrans](https://huggingface.co/law-ai/InLegalTrans-En2Indic-1B) model uses the same tokenizer as the [IndicTrans2](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B) model and has the same architecture with ~1.12B parameters.

```python
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit import IndicProcessor # Install IndicTransToolkit from https://github.com/VarunGumma/IndicTransToolkit

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
src_lang, tgt_lang = "eng_Latn", "ben_Beng" # Use the BCP-47 language codes used by the FLORES-200 dataset
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True) # Use IndicTrans2 tokenizer to enable their custom tokenization script to be run
model = AutoModelForSeq2SeqLM.from_pretrained(
    "law-ai/InLegalTrans-En2Indic-1B",
    trust_remote_code=True,
    attn_implementation="eager",
    low_cpu_mem_usage=True,
).to(device)
ip = IndicProcessor(inference=True)

input_sentences = [
    "(7) Any such allowance for the maintenance and expenses for proceeding shall be payable from the date of the order, or, if so ordered, from the date of the application for maintenance or expenses of proceeding, as the case may be.",
    "(2) Where it appears to the Tribunal that, in consequence of any decision of a competent Civil Court, any order made under section 9 should be cancelled or varied, it shall cancel the order or, as the case may be, vary the same accordingly.",
]

batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

input_text_encoding = tokenizer(
    batch,
    max_length=256,
    truncation=True,
    padding="longest",
    return_tensors="pt",
    return_attention_mask=True,
).to(device)

generated_tokens = model.generate(
    **input_text_encoding,
    max_length=256,
    do_sample=True,
    num_beams=4,
    num_return_sequences=1,
    early_stopping=False,
    use_cache=True,
)

with tokenizer.as_target_tokenizer():
    generated_tokens = tokenizer.batch_decode(
        generated_tokens.detach().cpu().tolist(),
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )

translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

for input_sentence, translation in zip(input_sentences, translations):
    print(f"Sentence in {src_lang} language: {input_sentence}") 
    print(f"Translated Sentence in {tgt_lang} language: {translation}") 
```

### Fine-tuning Results

The following table contains the performance results of the [InLegalTrans](https://huggingface.co/law-ai/InLegalTrans-En2Indic-1B) model compared to the [IndicTrans2](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B) model over the 10\% test split of **MILPaC**. Performances are evaluated using *BLEU*, *GLEU*, and *chrF++* metrics. For all English-to-Indic language pairs, [InLegalTrans](https://huggingface.co/law-ai/InLegalTrans-En2Indic-1B) demonstrated a significant improvement over [IndicTrans2](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B), achieving consistently better performance across all evaluation metrics.

| EN-to-IN   | Model               | BLEU | GLEU | chrF++ |
|------------|---------------------|------|------|--------|
| EN-to-BN   | *IndicTrans2*       | 25.4 | 28.8 | 53.7  |
|            | ***InLegalTrans***  | **45.8** | **47.6** | **70.9** |
| EN-to-HI   | *IndicTrans2*       | 41.0 | 42.5 | 59.9  |
|            | ***InLegalTrans***  | **56.9** | **57.1** | **73.8** |
| EN-to-MR   | *IndicTrans2*       | 25.2 | 28.7 | 55.4  |
|            | ***InLegalTrans***  | **44.4** | **46.0** | **68.9** |
| EN-to-TA   | *IndicTrans2*       | 32.8 | 35.3 | 62.3  |
|            | ***InLegalTrans***  | **40.0** | **42.5** | **69.9** |
| EN-to-TE   | *IndicTrans2*       | 10.7 | 14.2 | 37.9  |
|            | ***InLegalTrans***  | **31.3** | **31.6** | **58.5** |
| EN-to-ML   | *IndicTrans2*       | 21.9 | 25.8 | 52.9  |
|            | ***InLegalTrans***  | **37.4** | **40.3** | **69.7** |
| EN-to-PA   | *IndicTrans2*       | 27.8 | 31.6 | 51.5  |
|            | ***InLegalTrans***  | **44.3** | **45.6** | **65.5** |
| EN-to-GU   | *IndicTrans2*       | 27.5 | 31.1 | 55.7  |
|            | ***InLegalTrans***  | **42.8** | **45.2** | **68.8** |
| EN-to-OR   | *IndicTrans2*       | 06.6 | 12.6 | 37.1  |
|            | ***InLegalTrans***  | **14.2** | **19.9** | **47.5** |

### Citation

If you use this [InLegalTrans](https://huggingface.co/law-ai/InLegalTrans-En2Indic-1B) translation model or the [**MILPaC**](https://github.com/Law-AI/MILPaC) corpus, please cite the following paper:
```
@article{mahapatra2024milpacnovelbenchmarkevaluating,
      title = {MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages}, 
      author = {Sayan Mahapatra and Debtanu Datta and Shubham Soni and Adrijit Goswami and Saptarshi Ghosh},
      year = {2024},
      journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
      publisher = {Association for Computing Machinery},
}
```

### About Us

We are a group of Natural Language Processing (NLP) researchers from the *Indian Institute of Technology (IIT) Kharagpur*. Our research interests are primarily ML, DL, and NLP applications for the legal domain, with a special focus on the challenges and oppurtunites for the Indian legal scenario. Our current and past projects include:
- Legal Statute Identification
- Semantic segmentation of legal documents
- Monolingual (e.g., English-to-English) and Cross-lingual (e.g., English-to-Hindi) Summarization of legal documents
- Translation in the Indian legal domain
- Court Judgment Prediction
- Legal Document Matching

Explore our publicly available codes and datasets at: [Law and AI, IIT Kharagpur](https://github.com/Law-AI).