NanoBodyBERT
NanoBodyBERT is a BERT-based model specifically pre-trained on nanobody sequences for antibody design and analysis tasks.
Model Description
This model is trained using Masked Language Modeling (MLM) on nanobody sequences, with special focus on CDR (Complementarity-Determining Region) masking strategies.
Intended Use
This model is designed for:
- Nanobody sequence analysis
- CDR region reconstruction
- Sequence embedding generation
- Antibody design applications
How to Use
Installation
First, install the required dependencies:
pip install transformers torch
Loading the Model
import torch
from transformers import BertForMaskedLM
import sys
import os
# Load custom tokenizer (AATokenizer)
# You need to have the tokenizer.py file in your project
from tokenizer import AATokenizer
# Load model and tokenizer
model = BertForMaskedLM.from_pretrained("LLMasterLL/nanobodybert")
tokenizer = AATokenizer.from_pretrained("LLMasterLL/nanobodybert")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
Inference Example
# Example nanobody sequence
sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAKDYWGQGTQVTVSS"
# Encode sequence
input_ids = tokenizer.encode(sequence, add_special_tokens=True)
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)
# Get embeddings
with torch.no_grad():
outputs = model.bert(input_ids=input_ids, return_dict=True)
embeddings = outputs.last_hidden_state
cls_embedding = embeddings[0, 0, :] # [CLS] token embedding
print(f"CLS embedding shape: {cls_embedding.shape}")
Masked Prediction Example
# Create a masked sequence (mask CDR3 region for example)
masked_sequence = "QVQLVESGGGLVQPGGSLRLSCAASGFTFDDYSIAWFRQAPGKEREGVAAISWGGGSTYYADSVKGRFTISRDNAKNTLYLQMNSLRAEDTAVYYCAK[MASK][MASK][MASK][MASK]QGTQVTVSS"
# Tokenize
tokens = masked_sequence.replace("[MASK]", tokenizer.mask_token)
input_ids = tokenizer.encode(tokens, add_special_tokens=True)
input_ids = torch.tensor([input_ids], dtype=torch.long).to(device)
# Predict
with torch.no_grad():
outputs = model(input_ids=input_ids)
predictions = torch.argmax(outputs.logits, dim=-1)
# Decode
predicted_sequence = tokenizer.decode(predictions[0].cpu().tolist(), skip_special_tokens=True)
print(f"Predicted: {predicted_sequence}")
Model Architecture
- Architecture: BERT (Bidirectional Encoder Representations from Transformers)
- Vocabulary: 26 tokens (20 amino acids + special tokens)
- Max sequence length: 256
- Special tokens:
[PAD]: Padding token[CLS]: Classification token (sequence start)[SEP]: Separator token (sequence end)[MASK]: Mask token for MLM[UNK]: Unknown token
Training Data
The model was pre-trained on a curated dataset of nanobody sequences with strategic CDR masking.
Citation
If you use this model in your research, please cite:
@misc{nanobodybert,
title={NanoBodyBERT: BERT-based Pre-trained Model for Nanobody Sequences},
author={Ling Luo},
year={2025},
howpublished={\url{https://huggingface.co/LLMasterLL/nanobodybert}},
}
License
Apache 2.0
Contact
For questions and feedback, please open an issue on the repository.
- Downloads last month
- 11