---
language: 
  - en
thumbnail: "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/Seal_of_the_United_States_Patent_and_Trademark_Office.svg/1200px-Seal_of_the_United_States_Patent_and_Trademark_Office.svg.png"
tags:
- semantic-similarity
- patents
- legal-tech
- sentence-transformers
- ridge-regression
- scikit-learn
license: apache-2.0
datasets:
- uspto-patent-phrase-to-phrase
metrics:
- pearson
- cosine-similarity
base_model: sentence-transformers/all-mpnet-base-v2
---

# PatentSim: Semantic Similarity Model for U.S. Patent Phrase Matching

**Model Name:** `PatentSim Word Semantic Similarity`  
**Author:** Michael Posso
**Source:** Trained for the Kaggle "U.S. Patent Phrase to Phrase Matching" competition  
**License:** Apache 2.0

---

## Model Description

`PatentSim` is a lightweight, hybrid machine learning pipeline designed to evaluate the semantic similarity between pairs of phrases in the context of patent literature. The model combines a **pre-trained transformer-based sentence encoder** with a **ridge regression model** trained on cosine similarity scores. It was developed as part of the Kaggle competition hosted by the U.S. Patent and Trademark Office (USPTO).

This model addresses the critical need for semantic equivalence detection in legal and technical language, supporting tasks such as patent search, prior art retrieval, and claim comparison.

---

## Intended Use

- **Patent Prior Art Search**
- **Legal Text Similarity**
- **Technical Paraphrase Detection**
- **Domain-aware Semantic Matching**

---

## Model Architecture

- **Sentence Embeddings:** [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
- **Similarity Function:** Cosine Similarity between sentence embeddings
- **Regressor:** Scikit-learn Ridge Regression (`alpha=1.0`), trained to map cosine similarity to semantic relatedness scores

---

## Dataset

- **Source:** Kaggle – [U.S. Patent Phrase to Phrase Matching](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching)
- **Size:** 45,000+ anchor-target phrase pairs with similarity scores from 0.0 to 1.0
- **Domains:** Technical, scientific, commercial IP content
- **Features:** CPC classification codes to provide domain context

---

## Training Procedure

1. Sentence embeddings for anchor and target phrases were generated using `sentence-transformers/all-mpnet-base-v2`.
2. Embeddings were averaged with lowercased versions to improve normalization.
3. Cosine similarity between anchor and target embeddings formed the feature set.
4. Ridge regression was trained to predict human-labeled semantic similarity scores.

---

## Usage

```python
from sentence_transformers import SentenceTransformer, util
import joblib
import numpy as np

# Load embedding model
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

# Load regression model
from joblib import load
reg = load("ridge_model.joblib")

def predict_similarity(anchor, target):
    a_emb = model.encode(anchor, convert_to_tensor=True)
    t_emb = model.encode(target, convert_to_tensor=True)
    cosine_sim = util.cos_sim(a_emb, t_emb).item()
    score = reg.predict(np.array([[cosine_sim]]))[0]
    return score