Model Card for cisco-ai/SecureBERT2.0-cross-encoder

The SecureBERT 2.0 Cross-Encoder is a cybersecurity domain-specific model fine-tuned from SecureBERT 2.0.
It computes pairwise similarity scores between two texts, enabling use in text reranking, semantic search, and cybersecurity intelligence retrieval tasks.


Model Details

Model Description

  • Developed by: Cisco AI
  • Model type: Cross Encoder (Sentence Similarity)
  • Architecture: ModernBERT (fine-tuned via Sentence Transformers)
  • Max Sequence Length: 1024 tokens
  • Output Labels: 1 (similarity score)
  • Language: English
  • License: Apache-2.0
  • Finetuned from model: cisco-ai/SecureBERT2.0-base

Uses

Direct Use

  • Semantic text similarity in cybersecurity contexts
  • Text and code reranking for information retrieval (IR)
  • Threat intelligence question–answer relevance scoring
  • Cybersecurity report and log correlation

Downstream Use

Can be integrated into:

  • Cyber threat intelligence search engines
  • SOC automation pipelines
  • Cybersecurity knowledge graph enrichment
  • Threat hunting and incident response systems

Out-of-Scope Use

  • Generic text similarity outside the cybersecurity domain
  • Tasks requiring generative reasoning or open-domain question answering

Bias, Risks, and Limitations

The model reflects the distribution of cybersecurity-related data used during fine-tuning.
Potential risks include:

  • Overrepresentation of specific malware, technologies, or threat actors
  • Bias toward technical English sources
  • Reduced performance on non-English or mixed technical/natural text

Recommendations

Users should evaluate results for domain alignment and combine with other retrieval models or heuristic filters when applied to non-cybersecurity contexts.


How to Get Started with the Model

Using the Sentence Transformers API

Install dependencies

pip install -U sentence-transformers

Run Inference

from sentence_transformers import CrossEncoder

# Load the model
model = CrossEncoder("cisco-ai/SecureBERT2.0-cross-encoder")

# Example pairs
pairs = [
    ["How does Stealc malware extract browser data?", 
     "Stealc uses Sqlite3 DLL to query browser databases and retrieve cookies, passwords, and history."],
    ["Best practices for post-acquisition cybersecurity integration?", 
     "Conduct security assessment, align policies, integrate security technologies, and train employees."],
]

# Compute similarity scores
scores = model.predict(pairs)
print(scores)

Rank Candidate Responses

query = "How to prevent Kerberoasting attacks?"
candidates = [
    "Implement MFA and privileged access management",
    "Monitor Kerberos tickets for anomalous activity",
    "Apply zero-trust network segmentation",
]
ranking = model.rank(query, candidates)
print(ranking)

Framework Versions

  • python: 3.10.10
  • sentence_transformers: 5.0.0
  • transformers: 4.52.4
  • PyTorch: 2.7.0+cu128
  • accelerate: 1.9.0
  • datasets: 3.6.0

Training Details

Training Dataset

The model was fine-tuned on a cybersecurity sentence-pair similarity dataset for cross-encoder training.

  • Dataset Size: 35,705 samples
  • Columns: sentence1, sentence2, label

Average Lengths (first 1000 samples)

Field Mean Length
Sentence1 98.46
Sentence2 1468.34
Label 1.0

Example Schema

Field Type Description
sentence1 string Query or document text
sentence2 string Paired document or candidate response
label float Similarity score between the two inputs

Training Objective and Loss

The model was trained using a contrastive ranking objective to learn high-quality similarity scores between cybersecurity-related text pairs.

Loss Parameters

{
    "scale": 10.0,
    "num_negatives": 10,
    "activation_fn": "torch.nn.modules.activation.Sigmoid",
    "mini_batch_size": 24
}

Evaluation

Testing Data, Factors & Metrics

Testing Data

The evaluation was performed on a held-out test set of cybersecurity-related question–answer pairs and document retrieval tasks.
Data includes:

  • Threat intelligence descriptions and related advisories
  • Exploit procedure and mitigation text pairs
  • Cybersecurity Q&A and incident analysis examples

Factors

Evaluation considered multiple aspects of similarity and relevance:

  • Domain diversity: different cybersecurity subfields (malware, vulnerabilities, network defense)
  • Task diversity: retrieval, reranking, and relevance scoring
  • Pair length: from short queries to long technical documents

Metrics

The model was evaluated using standard information retrieval metrics:

  • Mean Average Precision (mAP): measures ranking precision across all retrieved results
  • Recall@1 (R@1): measures the proportion of correct top-1 matches
  • Normalized Discounted Cumulative Gain (NDCG@10): evaluates ranking quality up to the 10th result
  • Mean Reciprocal Rank (MRR@10): assesses the average rank position of the first correct answer

Results

Model mAP R@1 NDCG@10 MRR@10
ms-marco-TinyBERT-L2 0.920 0.849 0.964 0.955
SecureBERT 2.0 Cross-Encoder 0.955 0.948 0.986 0.983

Summary

The SecureBERT 2.0 Cross-Encoder achieves state-of-the-art retrieval and ranking performance on cybersecurity text similarity tasks.

Compared to the general-purpose ms-marco-TinyBERT-L2 baseline:

  • It improves mAP by +0.035
  • Achieves nearly perfect R@1 and MRR@10, indicating highly accurate top-1 retrieval
  • Shows the strongest NDCG@10, reflecting excellent ranking quality across top results

These results confirm that domain-specific pretraining and fine-tuning substantially enhance semantic understanding and information retrieval capabilities in cybersecurity applications.


Cite:

Bibtex

@article{aghaei2025securebert,
  title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
  author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
  journal={arXiv preprint arXiv:2510.00240},
  year={2025}
}

Model Card Authors

Cisco AI

Model Card Contact

For inquiries, please contact [email protected]

Downloads last month
17
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cisco-ai/SecureBERT2.0-cross_encoder

Finetuned
(4)
this model