Why does cosine similarity of identical embeddings not return exactly 1.0, and is there a way to ensure this?
#3
by
ChristianBecker
- opened
Hi,
I’m using ibm-granite/granite-embedding-125m-english to compute sentence embeddings. When I compute the cosine similarity between an embedding vector and itself, I expect to get 1.0.
However, I consistently get a value like 0.9991035461425781.
I understand this might be due to floating-point precision limits, but:
👉 Is there a recommended way (e.g., a different normalization step or method) to ensure that the cosine similarity between an embedding and itself is exactly 1.0?
👉 Is this slight deviation expected with this model’s output, or am I missing a step (e.g., an official normalization layer)?
Hi
@ChristianBecker
Am unable to replicate the issue above:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("ibm-granite/granite-embedding-125m-english")
texts = ["The Life Of Pi actor will be playing a gangster in the movie Sahib Biwi Aur Gangster Returns",
"Alice's Adventures in Wonderland (also known as Alice in Wonderland) is an 1865 English children's novel by Lewis Carroll, a mathematics don at the University of Oxford. It details the story of a girl named Alice who falls through a rabbit hole into a fantasy world of anthropomorphic creatures. It is seen as an example of the literary nonsense genre. The artist John Tenniel provided 42 wood-engraved illustrations for the book."]
embs = model.encode(texts)
util.cos_sim(embs,embs)
Out[9]:
tensor([[1.0000, 0.6155],
[0.6155, 1.0000]])```
Please share an example text where you encounter the issue.