Latin Intertextuality Embedding Model

This model is a fine-tuned version of SPhilBerta for generating embeddings of Latin texts to detect intertextual relationships between Jerome (Hieronymus) and other classical authors. This model is intended to integrate with the LociSimiles Python package for Latin intertextuality workflows: https://pypi.org/project/locisimiles/.

Model Description

  • Task: Sentence embedding for detecting intertextual links between classical Latin authors
  • Model type: Sentence Transformer (Embedding Model)
  • Base model: SPhilBerta
  • Max input tokens: 512
  • Language: Latin
  • License: Apache 2.0

Usage

This model generates dense vector embeddings for Latin text that can be used for semantic similarity tasks, particularly for detecting intertextual relationships. Important: This model was trained with prompts and should be used with the appropriate prompt names for optimal performance.

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import numpy as np

# Load model
model = SentenceTransformer("julian-schelb/SPhilBerta-emb-lat-intertext-v1")

# Example: Jerome text and candidates (1 positive match, 2 unrelated)
queries = [
    "omnia fert aetas, animum quoque; saepe ego longos cantando puerum memini me condere soles."
]
candidates = [
    "saepe ego longos cantando puerum memini me condere soles.",  # Positive match (subset of Jerome)
    "Gallia est omnis divisa in partes tres",  # Unrelated (Caesar)
    "in nova fert animus mutatas dicere formas"  # Unrelated (Ovid)
]

# Generate embeddings using prompt names
query_embeddings = model.encode(queries, prompt_name="query")
candidate_embeddings = model.encode(candidates, prompt_name="match")

# Calculate cosine similarity matrix
cosine_similarity_matrix = cos_sim(query_embeddings, candidate_embeddings)
print("Cosine Similarity Matrix:")
print("Query vs [Positive_Match, Caesar, Ovid]")
print(f"{cosine_similarity_matrix[0].numpy()}")
print(f"Highest similarity: {cosine_similarity_matrix[0].max():.4f} (index: {cosine_similarity_matrix[0].argmax()})")

# Alternative: Manual cosine similarity calculation
query_embedding = model.encode(queries[0], prompt_name="query")
candidate_embedding = model.encode(candidates[0], prompt_name="match")
cosine_sim = np.dot(query_embedding, candidate_embedding) / (np.linalg.norm(query_embedding) * np.linalg.norm(candidate_embedding))
print(f"\nDirect cosine similarity with positive match: {cosine_sim:.4f}")

Prompts

This model was trained with the following prompts:

  • Query texts: Use prompt_name="query" (corresponds to "Query: " prefix)
  • Candidate texts: Use prompt_name="match" (corresponds to "Candidate: " prefix)

For best results, always use the appropriate prompt names when encoding texts for similarity comparison.

Citation

TBD

Downloads last month
80
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for julian-schelb/SPhilBerta-emb-lat-intertext-v1

Base model

bowphs/SPhilBerta
Finetuned
(4)
this model

Space using julian-schelb/SPhilBerta-emb-lat-intertext-v1 1

Collection including julian-schelb/SPhilBerta-emb-lat-intertext-v1

Evaluation results