Why does cosine similarity of identical embeddings not return exactly 1.0, and is there a way to ensure this?

by ChristianBecker - opened Jun 25

Jun 25

Hi,
I’m using ibm-granite/granite-embedding-125m-english to compute sentence embeddings. When I compute the cosine similarity between an embedding vector and itself, I expect to get 1.0.

However, I consistently get a value like 0.9991035461425781.

I understand this might be due to floating-point precision limits, but:

👉 Is there a recommended way (e.g., a different normalization step or method) to ensure that the cosine similarity between an embedding and itself is exactly 1.0?

👉 Is this slight deviation expected with this model’s output, or am I missing a step (e.g., an official normalization layer)?

pawasthy

IBM Granite org 30 days ago

Hi @ChristianBecker
Am unable to replicate the issue above:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("ibm-granite/granite-embedding-125m-english")
texts = ["The Life Of Pi actor will be playing a gangster in the movie Sahib Biwi Aur Gangster Returns", 
"Alice's Adventures in Wonderland (also known as Alice in Wonderland) is an 1865 English children's novel by Lewis Carroll, a mathematics don at the University of Oxford. It details the story of a girl named Alice who falls through a rabbit hole into a fantasy world of anthropomorphic creatures. It is seen as an example of the literary nonsense genre. The artist John Tenniel provided 42 wood-engraved illustrations for the book."]

embs = model.encode(texts)
util.cos_sim(embs,embs)

Out[9]: 
tensor([[1.0000, 0.6155],
        [0.6155, 1.0000]])```

Please share an example text where you encounter the issue.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment