Why does cosine similarity of identical embeddings not return exactly 1.0, and is there a way to ensure this?

#3
by ChristianBecker - opened

Hi,
I’m using ibm-granite/granite-embedding-125m-english to compute sentence embeddings. When I compute the cosine similarity between an embedding vector and itself, I expect to get 1.0.

However, I consistently get a value like 0.9991035461425781.

I understand this might be due to floating-point precision limits, but:

👉 Is there a recommended way (e.g., a different normalization step or method) to ensure that the cosine similarity between an embedding and itself is exactly 1.0?

👉 Is this slight deviation expected with this model’s output, or am I missing a step (e.g., an official normalization layer)?

IBM Granite org

Hi @ChristianBecker
Am unable to replicate the issue above:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("ibm-granite/granite-embedding-125m-english")
texts = ["The Life Of Pi actor will be playing a gangster in the movie Sahib Biwi Aur Gangster Returns", 
"Alice's Adventures in Wonderland (also known as Alice in Wonderland) is an 1865 English children's novel by Lewis Carroll, a mathematics don at the University of Oxford. It details the story of a girl named Alice who falls through a rabbit hole into a fantasy world of anthropomorphic creatures. It is seen as an example of the literary nonsense genre. The artist John Tenniel provided 42 wood-engraved illustrations for the book."]

embs = model.encode(texts)
util.cos_sim(embs,embs)

Out[9]: 
tensor([[1.0000, 0.6155],
        [0.6155, 1.0000]])```

Please share an example text where you encounter the issue.

Sign up or log in to comment