IDP-ESM2-8M
IDP-ESM2-150M is an ESM2-style encoder for intrinsically disorded protein sequence representation learning, trained on IDP-Euka-90.
This repository provides a Transformer encoder suitable for extracting per-sequence embeddings (mean-pooled over residues with padding masked out).
Quick start: generate mean-pooled embeddings
The snippet below loads the tokenizer and model, runs a forward pass on a couple of sequences, and computes mean-pooled embeddings (ignoring padding) โ exactly the setup typically used for downstream tasks.
from transformers import AutoTokenizer, AutoModel
import torch
# --- Config ---
model_name = "InstaDeepAI/IDP-ESM2-150M"
# --- Load model and tokenizer ---
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
model = AutoModel.from_pretrained(model_name)
model.eval()
# (optional) use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# --- Input sequences ---
sequences = [
"MDDNHYPHHHHNHHNHHSTSGGCGESQFTTKLSVNTFARTHPMIQNDLIDLDLISGSAFTMKSKSQQ",
"PADRDLSSPFGSTVPGVGPNAAAASNAAAAAAAAATAGSNKHQTPPTTFR",
]
# --- Tokenize ---
inputs = tokenizer(
sequences,
return_tensors="pt",
padding=True,
truncation=True,
)
inputs = {k: v.to(device) for k, v in inputs.items()}
# --- Forward pass ---
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # shape: (batch, seq_len, hidden_dim)
# --- Compute mean embedding per sequence (excluding padding) ---
attention_mask = inputs["attention_mask"]
sum_embeddings = torch.sum(embeddings * attention_mask.unsqueeze(-1), dim=1)
lengths = attention_mask.sum(dim=1, keepdim=True)
avg_embeddings = sum_embeddings / lengths # shape: (batch, hidden_dim)
print("Average embedding shape:", avg_embeddings.shape)
print("Example embedding:", avg_embeddings[0, :5]) # show first 5 dims for IDP-ESM2-8M
- Downloads last month
- 4