Legal-BERT-PEFT-EURLEX

A BERT model fine-tuned on EU legal documents from the Pile of Law dataset using Parameter-Efficient Fine-Tuning (PEFT).

Model Details

Base Model

  • Architecture: bert-base-uncased
  • Parameters: 110 million
  • Language: English

Fine-tuning Details

  • Method: PEFT with LoRA (Low-Rank Adaptation)
  • Trainable Parameters: 1.3 million (1.21% of total)
  • Training Approach: Masked Language Modeling (MLM)

Training Data

  • Dataset: EURLEX subset of Pile of Law
  • Training Samples: 20,000 legal documents
  • Domain: European Union Legal Documents
  • Text Length: Average 13,327 characters per document

Performance

Training Results

Metric Base Model Fine-tuned Model Improvement
Test Loss 1.9327 0.6580 66.69%
Perplexity 6.91 1.93 72.07%

Training Configuration

Hyperparameters

Parameter Value
Learning Rate 2e-4
Batch Size 16 (8 × gradient accumulation 2)
Epochs 3
Max Sequence Length 512
Warmup Steps 500
Weight Decay 0.01

Intended Use Cases

Recommended Use

  • Legal document analysis and processing
  • Masked language modeling in legal contexts
  • Legal text understanding and generation
  • Research in computational law and legal AI
  • Educational purposes in legal technology

Limitations and Bias

Limitations

  • Domain Specific: Primarily effective on legal text, especially EU law
  • Language: English only
  • Scope: Trained on a subset of EURLEX documents
  • Temporal Scope: Training data up to 2022 only

Qualitative Examples

Example 1: Legal Judgment

Input: "The court found the defendant [MASK] of all charges."
Predictions: ["guilty", "innocent", "acquitted", "free", "liable"]

Example 2: Contract Law

Input: "The contract was declared [MASK] due to fraudulent activities."
Predictions: ["void", "invalid", "null", "bankrupt", "cancelled"]

Example 3: Civil Law

Input: "The plaintiff sought [MASK] for damages incurred."
Predictions: ["compensation", "damages", "only", "insurance", "forgiveness"]

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("Nahla-yasmine/legal-bert-peft-eurlex")
tokenizer = AutoTokenizer.from_pretrained("Nahla-yasmine/legal-bert-peft-eurlex")

# Example: Masked language prediction
text = "The court found the defendant [MASK] of all charges."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get top predictions
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
logits = outputs.logits[0, mask_token_index, :]
top_tokens = torch.topk(logits, 5, dim=1).indices[0].tolist()

for i, token_id in enumerate(top_tokens):
    predicted_token = tokenizer.decode([token_id])
    print(f"{i+1}. {predicted_token}")

Advanced Usage with PEFT

from peft import PeftModel, PeftConfig
from transformers import AutoModelForMaskedLM

# Load base model
base_model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# Load PEFT adapter
model = PeftModel.from_pretrained(base_model, "Nahla-yasmine/legal-bert-peft-eurlex")

@software{legal_bert_peft_2024, title = {Legal-BERT-PEFT-EURLEX}, author = {Nahla-yasmine}, year = {2024}, url = {https://huggingface.co/Nahla-yasmine/legal-bert-peft-eurlex} }

  • PEFT 0.17.0
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nahla-yasmine/legal-bert-peft-eurlex

Adapter
(99)
this model

Dataset used to train Nahla-yasmine/legal-bert-peft-eurlex