BART-IT-LSG-4096 / README.md

Update README.md

5886f20 verified 8 months ago

4.95 kB

	---
	language:
	- it
	tags:
	- text2text-generation
	- summarization
	license: mit
	datasets:
	- joelniklaus/Multi_Legal_Pile
	library_name: transformers
	pipeline_tag: text2text-generation
	widget:
	- text: "<mask> 1234: Il contratto si intende concluso quando..."
	base_model:
	- morenolq/bart-it
	---

	# 📌 Model Card: LEGIT-BART Series

	## 🏛️ Model Overview
	The LEGIT-BART models are a family of pre-trained transformer-based models for Italian legal text processing.
	They build upon BART-IT ([`morenolq/bart-it`](https://huggingface.co/morenolq/bart-it)) and are further pre-trained on Italian legal corpora.

	💡 Key features:
	- Extended context length with Local-Sparse-Global (LSG) Attention (up to 16,384 tokens) 📜
	- Trained on legal documents such as statutes, case law, and contracts 📑
	- Not fine-tuned for specific tasks (requires further adaptation)

	⚠️ This specific model is pre-trained on general-purpose Italian text! Please select the best model from the table below.

	## 📂 Available Models

	\| Model \| Description \| Link \|
	\|--------\|-------------\|------\|
	\| LEGIT-BART \| Continued pre-training of `morenolq/bart-it` on Italian legal texts \| [🔗 Link](https://huggingface.co/morenolq/LEGIT-BART) \|
	\| LEGIT-BART-LSG-4096 \| Continued pre-training of `morenolq/bart-it`, supporting 4,096 tokens \| [🔗 Link](https://huggingface.co/morenolq/LEGIT-BART-LSG-4096) \|
	\| LEGIT-BART-LSG-16384 \| Continued pre-training of `morenolq/bart-it`, supporting 16,384 tokens \| [🔗 Link](https://huggingface.co/morenolq/LEGIT-BART-LSG-16384) \|
	\| LEGIT-SCRATCH-BART \| Trained from scratch on Italian legal texts \| [🔗 Link](https://huggingface.co/morenolq/LEGIT-SCRATCH-BART) \|
	\| LEGIT-SCRATCH-BART-LSG-4096 \| Trained from scratch with LSG attention, supporting 4,096 tokens \| [🔗 Link](https://huggingface.co/morenolq/LEGIT-SCRATCH-BART-LSG-4096) \|
	\| LEGIT-SCRATCH-BART-LSG-16384 \| Trained from scratch with LSG attention, supporting 16,384 tokens \| [🔗 Link](https://huggingface.co/morenolq/LEGIT-SCRATCH-BART-LSG-16384) \|
	\| BART-IT-LSG-4096 \| `morenolq/bart-it` with LSG attention, supporting 4,096 tokens (⚠️ no legal adaptation) \| [🔗 Link](https://huggingface.co/morenolq/BART-IT-LSG-4096)
	\| BART-IT-LSG-16384 \| `morenolq/bart-it` with LSG attention, supporting 16,384 tokens (⚠️ no legal adaptation) \| [🔗 Link](https://huggingface.co/morenolq/BART-IT-LSG-16384) \|

	---

	## 🛠️ Model Details

	🔹 Architecture
	- Base Model: [`morenolq/bart-it`](https://huggingface.co/morenolq/bart-it)
	- Transformer Encoder-Decoder
	- LSG Attention for long documents
	- Specific tokenizers for models trained from scratch (underperforming continual pre-training in our experiments).

	🔹 Training Data
	- Dataset: [`joelniklaus/Multi_Legal_Pile`](https://huggingface.co/datasets/joelniklaus/Multi_Legal_Pile)
	- Types of legal texts used:
	- Legislation (laws, codes, amendments)
	- Case law (judicial decisions)
	- Contracts (public legal agreements)

	---

	## 🚀 How to Use

	```python
	from transformers import BartForConditionalGeneration, AutoTokenizer

	# Load tokenizer and model
	model_name = "morenolq/BART-IT-LSG-4096"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = BartForConditionalGeneration.from_pretrained(model_name)

	# Example input
	input_text = "<mask> 1234: Il contratto si intende concluso quando..."
	inputs = tokenizer(input_text, return_tensors="pt", max_length=4096, truncation=True)

	# Generate summary
	summary_ids = model.generate(inputs.input_ids, max_length=150, num_beams=4, early_stopping=True)
	summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
	print("📝 Summary:", summary)
	```

	---

	⚠️ Limitations & Ethical Considerations
	- Not fine-tuned for specific tasks: The models are pre-trained on legal texts and may require further adaptation for specific legal NLP tasks (e.g., summarization, question-answering).
	- Bias and fairness: Legal texts may contain biases present in the legal system. Care should be taken to ensure fairness and ethical use of the models.
	- Legal advice: The models are not a substitute for professional legal advice. Always consult a qualified legal professional for legal matters.

	---

	## 📚 Reference

	The paper presenting LEGIT-BART models is currently under review and will be updated here once published.

	```bibtex
	@article{benedetto2025legitbart,
	title = {LegItBART: a summarization model for Italian legal documents},
	author = {Benedetto, Irene and La Quatra, Moreno and Cagliero, Luca},
	year = 2025,
	journal = {Artificial Intelligence and Law},
	publisher = {Springer},
	pages = {1--31},
	doi = {10.1007/s10506-025-09436-y},
	url = {doi.org/10.1007/s10506-025-09436-y}
	}
	```

	---