LexBartLo_2 / README.md

Update README.md

305988f verified 6 months ago

4.94 kB

	---
	library_name: peft
	base_model: facebook/bart-large
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->



	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->


	- Paper: The model was published in "A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization" available in https://arxiv.org/abs/2503.10354 or https://ieeexplore.ieee.org/document/11030964
	- Developed by: Nevidu Jayatilleke and Ruvan Weerasinghe
	<!-- - Funded by [optional]: [More Information Needed]
	- Shared by [optional]: [More Information Needed] -->
	<!-- - Model type: [More Information Needed] -->
	- Supported Language: English
	- Finetuned Domains: Textile, Mechanical Engineering, Fixed
	Construction, and Human Necessities Patent Documents from BigPatent Dataset
	<!-- - License: [More Information Needed] -->
	- Finetuned from model: facebook/bart-large
	- Link to the Specialised Model: https://huggingface.co/Nevidu/LexBartLo_1

	<!-- ### Model Sources -->

	<!-- Provide the basic links for the model. -->

	<!-- - Repository: [More Information Needed] -->


	## How to use the model

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	```python
	from peft import PeftModel, PeftConfig
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
	import nltk
	from nltk.tokenize import sent_tokenize, word_tokenize
	from nltk.corpus import stopwords
	from nltk.cluster.util import cosine_distance
	import numpy as np
	import networkx as nx
	import pandas as pd

	def preprocess_text(text):
	sentences = sent_tokenize(text)
	tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
	return tokenized_sentences

	def sentence_similarity(sentence1, sentence2):
	stop_words = set(stopwords.words('english'))
	filtered_sentence1 = [w for w in sentence1 if w not in stop_words]
	filtered_sentence2 = [w for w in sentence2 if w not in stop_words]
	all_words = list(set(filtered_sentence1 + filtered_sentence2))
	vector1 = [filtered_sentence1.count(word) for word in all_words]
	vector2 = [filtered_sentence2.count(word) for word in all_words]
	return 1 - cosine_distance(vector1, vector2)

	def build_similarity_matrix(sentences):
	similarity_matrix = np.zeros((len(sentences), len(sentences)))
	for i in range(len(sentences)):
	for j in range(len(sentences)):
	if i != j:
	similarity_matrix[i][j] = sentence_similarity(sentences[i], sentences[j])
	return similarity_matrix

	def apply_lexrank(similarity_matrix, damping=0.85, threshold=0.2, max_iter=100):
	nx_graph = nx.from_numpy_array(similarity_matrix)
	scores = nx.pagerank(nx_graph, alpha=damping, tol=threshold, max_iter=max_iter)
	return scores

	def get_top_sentences(sentences, scores):
	ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(sentences)), reverse=True)
	top_sentences = [sentence for score, sentence in ranked_sentences]
	return top_sentences

	def extract_important_sentences(text):
	preprocessed_sentences = preprocess_text(text)
	similarity_matrix = build_similarity_matrix(preprocessed_sentences)
	scores = apply_lexrank(similarity_matrix)
	top_sentences = get_top_sentences(preprocessed_sentences, scores)
	paragraph = ' '.join([' '.join(sentence) for sentence in top_sentences])
	return paragraph

	def summarize(text, max_tokens):

	peft_model = "Nevidu/LexBartLo_2"
	config = PeftConfig.from_pretrained(peft_model)

	# load base LLM model and tokenizer
	model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
	tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

	# Load the Lora model
	model = PeftModel.from_pretrained(model, peft_model)

	sorted_text = extract_important_sentences(text)

	input_ids = tokenizer(sorted_text, return_tensors="pt", truncation=True).input_ids
	# with torch.inference_mode():
	outputs = model.generate(input_ids=input_ids, max_new_tokens=max_tokens, do_sample=True, top_p=0.9)
	summary = tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]
	return summary

	text = """ Add your patent text"""
	max_tokens = 256

	summary = summarize(text, max_tokens)
	```

	## Citation

	```json
	@inproceedings{jayatilleke2025hybrid,
	title={A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization},
	author={Jayatilleke, Nevidu and Weerasinghe, Ruvan},
	booktitle={2025 International Research Conference on Smart Computing and Systems Engineering (SCSE)},
	pages={1--6},
	year={2025},
	organization={IEEE}
	}
	```

	### Framework versions

	- PEFT 0.9.0