--- license: mit language: - en - bn - hi - mr - ta - te - ml - pa - gu - or base_model: - ai4bharat/indictrans2-en-indic-1B pipeline_tag: translation metrics: - bleu - google_bleu - chrf++ inference: false datasets: - MILPaC tags: - InLegalTrans - Legal - NLP --- # InLegalTrans This is the model card of ***InLegalTrans-En2Indic-1B*** translation model, a fine-tuned version of the [IndicTrans2](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B) model specifically tailored for translating Indian legal texts from English to Indian languages. ### Training Data We use the [**MILPaC**](https://github.com/Law-AI/MILPaC) **(Multilingual Indian Legal Parallel Corpus)** corpus for fine-tuning. It is the first high-quality Indian legal parallel corpus, containing parallel aligned text units in English (EN) and nine Indian (IN) languages -- Bengali (BN), Hindi (HI), Marathi (MR), Tamil (TA), Telugu (TE), Malayalam (ML), Panjabi (PA), Gujarati (GU), and Oriya (OR). Please refer to the [paper](https://arxiv.org/abs/2310.09765) for more details about this corpus. For fine-tuning, we randomly split MILPaC language-wise in a 80 (train) - 10 (validation) - 10 (test) ratio. We use the 80\% train split (combined 80\% of each English-to-Indic language pair) to fine-tune the [IndicTrans2](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B) model and 10\% validation split (combined 10\% of each English-to-Indic language pair) to select the best checkpoint and to prevent overfitting. ### Model Overview and Usage Instructions This [InLegalTrans](https://huggingface.co/law-ai/InLegalTrans-En2Indic-1B) model uses the same tokenizer as the [IndicTrans2](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B) model and has the same architecture with ~1.12B parameters. ```python import torch from transformers import AutoModelForSeq2SeqLM, AutoTokenizer from IndicTransToolkit import IndicProcessor # Install IndicTransToolkit from https://github.com/VarunGumma/IndicTransToolkit device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") src_lang, tgt_lang = "eng_Latn", "ben_Beng" # Use the BCP-47 language codes used by the FLORES-200 dataset tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True) # Use IndicTrans2 tokenizer to enable their custom tokenization script to be run model = AutoModelForSeq2SeqLM.from_pretrained( "law-ai/InLegalTrans-En2Indic-1B", trust_remote_code=True, attn_implementation="eager", low_cpu_mem_usage=True, ).to(device) ip = IndicProcessor(inference=True) input_sentences = [ "(7) Any such allowance for the maintenance and expenses for proceeding shall be payable from the date of the order, or, if so ordered, from the date of the application for maintenance or expenses of proceeding, as the case may be.", "(2) Where it appears to the Tribunal that, in consequence of any decision of a competent Civil Court, any order made under section 9 should be cancelled or varied, it shall cancel the order or, as the case may be, vary the same accordingly.", ] batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang) input_text_encoding = tokenizer( batch, max_length=256, truncation=True, padding="longest", return_tensors="pt", return_attention_mask=True, ).to(device) generated_tokens = model.generate( **input_text_encoding, max_length=256, do_sample=True, num_beams=4, num_return_sequences=1, early_stopping=False, use_cache=True, ) with tokenizer.as_target_tokenizer(): generated_tokens = tokenizer.batch_decode( generated_tokens.detach().cpu().tolist(), skip_special_tokens=True, clean_up_tokenization_spaces=True, ) translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang) for input_sentence, translation in zip(input_sentences, translations): print(f"Sentence in {src_lang} language: {input_sentence}") print(f"Translated Sentence in {tgt_lang} language: {translation}") ``` ### Fine-tuning Results The following table contains the performance results of the [InLegalTrans](https://huggingface.co/law-ai/InLegalTrans-En2Indic-1B) model compared to the [IndicTrans2](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B) model over the 10\% test split of **MILPaC**. Performances are evaluated using *BLEU*, *GLEU*, and *chrF++* metrics. For all English-to-Indic language pairs, [InLegalTrans](https://huggingface.co/law-ai/InLegalTrans-En2Indic-1B) demonstrated a significant improvement over [IndicTrans2](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B), achieving consistently better performance across all evaluation metrics. | EN-to-IN | Model | BLEU | GLEU | chrF++ | |------------|---------------------|------|------|--------| | EN-to-BN | *IndicTrans2* | 25.4 | 28.8 | 53.7 | | | ***InLegalTrans*** | **45.8** | **47.6** | **70.9** | | EN-to-HI | *IndicTrans2* | 41.0 | 42.5 | 59.9 | | | ***InLegalTrans*** | **56.9** | **57.1** | **73.8** | | EN-to-MR | *IndicTrans2* | 25.2 | 28.7 | 55.4 | | | ***InLegalTrans*** | **44.4** | **46.0** | **68.9** | | EN-to-TA | *IndicTrans2* | 32.8 | 35.3 | 62.3 | | | ***InLegalTrans*** | **40.0** | **42.5** | **69.9** | | EN-to-TE | *IndicTrans2* | 10.7 | 14.2 | 37.9 | | | ***InLegalTrans*** | **31.3** | **31.6** | **58.5** | | EN-to-ML | *IndicTrans2* | 21.9 | 25.8 | 52.9 | | | ***InLegalTrans*** | **37.4** | **40.3** | **69.7** | | EN-to-PA | *IndicTrans2* | 27.8 | 31.6 | 51.5 | | | ***InLegalTrans*** | **44.3** | **45.6** | **65.5** | | EN-to-GU | *IndicTrans2* | 27.5 | 31.1 | 55.7 | | | ***InLegalTrans*** | **42.8** | **45.2** | **68.8** | | EN-to-OR | *IndicTrans2* | 06.6 | 12.6 | 37.1 | | | ***InLegalTrans*** | **14.2** | **19.9** | **47.5** | ### Citation If you use this [InLegalTrans](https://huggingface.co/law-ai/InLegalTrans-En2Indic-1B) translation model or the [**MILPaC**](https://github.com/Law-AI/MILPaC) corpus, please cite the following paper: ``` @article{mahapatra2024milpacnovelbenchmarkevaluating, title = {MILPaC: A Novel Benchmark for Evaluating Translation of Legal Text to Indian Languages}, author = {Sayan Mahapatra and Debtanu Datta and Shubham Soni and Adrijit Goswami and Saptarshi Ghosh}, year = {2024}, journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.}, publisher = {Association for Computing Machinery}, } ``` ### About Us We are a group of Natural Language Processing (NLP) researchers from the *Indian Institute of Technology (IIT) Kharagpur*. Our research interests are primarily ML, DL, and NLP applications for the legal domain, with a special focus on the challenges and oppurtunites for the Indian legal scenario. Our current and past projects include: - Legal Statute Identification - Semantic segmentation of legal documents - Monolingual (e.g., English-to-English) and Cross-lingual (e.g., English-to-Hindi) Summarization of legal documents - Translation in the Indian legal domain - Court Judgment Prediction - Legal Document Matching Explore our publicly available codes and datasets at: [Law and AI, IIT Kharagpur](https://github.com/Law-AI).