--- license: mit tags: - sentence-transformers - chemistry - molecular-similarity - cheminformatics - ssl - smiles - feature-extraction pipeline_tag: sentence-similarity library_name: sentence-transformers --- # miniChembed-prototype This is an experimental **self-supervised molecular embedding** model trained using the **Barlow Twins** objective on approximately **24K unlabeled SMILES strings**. If validated as effective, it will be scaled to 2.1M molecules. The training data were compiled from public sources including: - **ChEMBL34** (Zdrazil et al., 2023) - **COCONUTDB** (Sorokina et al., 2021) - **SuperNatural3** (Gallo et al., 2023) The model maps SMILES strings to a **320-dimensional dense vector space**, optimized for **molecular similarity search, clustering, and scaffold analysis without any supervision from bioactivity, property labels, or precomputed fingerprints**. Unlike fixed fingerprints (e.g., ECFP4), this model learns representations directly from **stochastic SMILES augmentations**, encouraging invariance to syntactic variation while potentially maximizing representational diversity across molecules. The Barlow Twins objective explicitly minimizes redundancy between embedding dimensions, promoting structured, non-collapsed representations. > Note: This is an experimental prototype. > Feel free to experiment with and edit the training script as you wish! > Correcting my mistakes, tweaking augmentations, loss weights, optimizer settings, or network architecture could lead to even better representations. --- ## Model Details ### Architecture & Training | Attribute | Value | |----------|-------| | **Base architecture** | Custom RoBERTa-style transformer (6 layers, 320 hidden dim, 4 attention heads, ~8M params) | | **Initialization** | Random (not pretrained on text or chemistry) | | **Training objective** | **Barlow Twins**, redundancy-reduction via cross-correlation matrix | | **Augmentation** | Stochastic SMILES enumeration (`MolToSmiles(..., doRandom=True)`) | | **Training data** | ~24K unique molecules → augmented into positive pairs | | **Sequence length** | 514 tokens | | **Embedding dimension** | 320 | | **Projection head** | 3-layer MLP with BatchNorm (2048 → 2048 → 2048) | | **Pooling** | Mean pooling over token embeddings | | **Similarity metric** | Cosine similarity | | **Effective batch size** | 64 (physical batch: 16, gradient accumulation: 4×) | | **Learning rate** | 1e-4 | | **Optimizer** | **Ranger21** (with warmup/warmdown scheduling) | | **Weight decay** | 0.01 (applied selectively: no decay on bias/LayerNorm) | | **Barlow λ** | 5.0 (stronger off-diagonal penalty) | | **Training duration** | 5 epochs | | **Hardware** | Single NVIDIA 930MX GPU | ### Architecture (SentenceTransformer format) ```python SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'}) (1): Pooling({'word_embedding_dimension': 320, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) ) ``` > Note: The model was not initialized from a language model, it is trained from scratch on SMILES using only the Barlow Twins objective. --- ## Usage ### Installation ```bash pip install -U sentence-transformers rdkit-pypi ``` ### Direct Usage (Sentence Transformers) ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("gbyuvd/miniChembed-prototype") # Run inference sentences = [ 'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3', # Cytisine "n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4", # Varenicline "c1ncccc1[C@@H]2CCCN2C", # Nicotine 'Nc1nc2cncc-2co1', # CID: 162789184 ] embeddings = model.encode(sentences) print(embeddings.shape) # (4, 320) # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities) # tensor([[ 1.0000, 0.2279, -0.1979, -0.3754], # [ 0.2279, 1.0000, 0.7371, 0.6745], # [-0.1979, 0.7371, 1.0000, 0.9803], # [-0.3754, 0.6745, 0.9803, 1.0000]]) ``` High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling. ### Testing Similarity Search > Tip: For large-scale similarity search, integrate embeddings with Meta's FAISS. For an example of FAISS indexing pipeline, see `./examples/faiss.ipynb` Cytisine as query, on 24K embedded index: ![image](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F667da868d653c0b02d6a2399%2FkZciikiDjFOCXJrCzb1Lh.png) ``` Rank 1: SMILES = O=C1OC2C(O)CC1C1C2N(Cc2ccc(F)cc2)C(=S)N1CC1CCCCC1, Cosine Similarity = 0.9944 Rank 2: SMILES = CN1C(CCC(=O)N2CCC(O)CC2)CNC(=O)C2C1CCN2Cc1ncc[nH]1, Cosine Similarity = 0.9940 Rank 3: SMILES = CC1C(=O)OC2C1CCC1(C)Cc3sc(NC(=O)Nc4cccc(F)c4)nc3C(C)C21, Cosine Similarity = 0.9938 Rank 4: SMILES = Cc1ccc(NC(=O)Nc2nc3c(s2)CC2(C)CCC4C(C)C(=O)OC4C2C3C)cc1, Cosine Similarity = 0.9938 Rank 5: SMILES = O=C(CC1CC2OC(CNC3Cc4ccccc4C3)C(O)C2O1)N1CCC(F)(F)C1, Cosine Similarity = 0.9929 ``` ## Comparison to Traditional Fingerprints ### Overview | Feature | ECFP4 / MACCS | miniChembed-prototype | |--------|----------------|------------------------| | **Representation** | Hand-crafted binary fingerprint | Learned dense embedding | | **Training data** | None (rule-based) | ~24K unlabeled SMILES | | **Global semantics** | Captures only local substructures | Learns global invariances via augmentation | | **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) | ### Clustering Preliminary clustering evaluation vs. ECFP4 on 64 molecules with 4 classes: ![image](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F667da868d653c0b02d6a2399%2FSNH7u0tegdzmYGFbJ9F-0.png) ``` ARI (Embeddings) : 0.084 ARI (ECFP4) : 0.024 Silhouette (Embeddings) : 0.398 Silhouette (ECFP4) : 0.025 ``` --- ## Training Summary - **Objective**: Minimize off-diagonal terms in the cross-correlation matrix of augmented views. - **Key metric**: Barlow Health Score = `mean(same-molecule cosine) – mean(cross-molecule cosine)` → Higher = better separation between intra- and inter-molecular similarity. - **Validation**: Evaluated every 25% of training; best checkpoint selected by health score. - **Final health**: 0.891 at step 1885, indicating strong disentanglement. ``` Step 1885 | Alignment=0.017 | Uniformity=-1.338 Same-mol cos: 0.983±0.032 | Pairwise: 0.093±0.518 Barlow Health: 0.891 ``` --- ## Limitations - Trained on **drug-like organic molecules**; performance on inorganics, salts, or polymers is unknown. - Input must be **valid SMILES**; invalid strings may produce erratic embeddings. - **Not trained on bioactivity data**, so similarity indicates structural syntax, not biological function. - Small-scale prototype (~24K); final version will scale to 2.1M molecules if proven effective. --- ## Reproducibility This model was trained using a custom script based on Sentence Transformers v5.1.0, with the following environment: - Python: 3.13.0 - Transformers: 4.56.2 - PyTorch: 2.6.0+cu126 - Accelerate: 1.10.1 - Datasets: 4.0.0 - Tokenizers: 0.22.0 Training code, config, and evaluation are available on this repo under `./train/trainbarlow.py` and `./train/config.yaml` --- ## Reference: Do note that the method used here doesn't use a target network, rather, using RDKit-augmented enumeration of each molecule's SMILES. ``` @misc{çağatan2024unseeunsupervisednoncontrastivesentence, title={UNSEE: Unsupervised Non-contrastive Sentence Embeddings}, author={Ömer Veysel Çağatan}, year={2024}, eprint={2401.15316}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2401.15316}, } ``` --- ## Citation If you use this model, please cite: ```bibtex SBERT: @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", year = "2019", url = "https://arxiv.org/abs/1908.10084" } Tokenizer: @misc{chithrananda2020chembertalargescaleselfsupervisedpretraining, title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction}, author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar}, year={2020}, eprint={2010.09885}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2010.09885}, } Data: @article{sorokina2021coconut, title={COCONUT online: Collection of Open Natural Products database}, author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph}, journal={Journal of Cheminformatics}, volume={13}, number={1}, pages={2}, year={2021}, doi={10.1186/s13321-020-00478-9} } @article{zdrazil2023chembl, title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods}, author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R}, journal={Nucleic Acids Research}, year={2023}, volume={gkad1004}, doi={10.1093/nar/gkad1004} } @misc{chembl34, title={ChemBL34}, year={2023}, doi={10.6019/CHEMBL.database.34} } @article{Gallo2023, author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P}, title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}}, journal = {Nucleic Acids Research}, year = {2023}, month = jan, day = {6}, volume = {51}, number = {D1}, pages = {D654-D659}, doi = {10.1093/nar/gkac1008} } Optimizer: @article{wright2021ranger21, title={Ranger21: a synergistic deep learning optimizer}, author={Wright, Less and Demeure, Nestor}, year={2021}, journal={arXiv preprint arXiv:2106.13731}, } ```