SAFE_20M

SAFE_20M is a transformer-based model designed for molecular generation tasks. This model was trained from scratch on the MOSES dataset, which has been converted from SMILES to the SAFE (SMILES Augmented For Encoding) format to enhance molecular representation for machine learning applications.

Evaluation Results

On the evaluation set, SAFE_20M achieved the following result:

  • Loss: 0.4024

Model Description

SAFE_20M leverages the SAFE framework to generate valid and diverse molecular structures. By converting the MOSES dataset from SMILES to SAFE format, the model benefits from improved molecular encoding, facilitating better performance in various applications such as:

  • Drug Discovery: Identifying potential drug candidates with desirable properties.
  • Materials Science: Designing new materials with specific characteristics.
  • Chemical Engineering: Innovating chemical processes and compounds.

SAFE Framework

The SAFE framework, integral to SAFE_20M, was introduced in the following paper:

@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}

We acknowledge and thank the authors for their valuable contribution to the field of molecular design.

Intended Uses & Limitations

Intended Uses

SAFE_20M is primarily intended for:

  • Generating Molecular Structures: Creating novel molecules with desired properties.
  • Exploring Chemical Space: Navigating the vast landscape of possible chemical compounds for research and development.
  • Assisting in Material Design: Facilitating the creation of new materials with specific functionalities.

Limitations

  • Validation Required: Outputs should be validated by domain experts before practical application.
  • Synthetic Feasibility: Generated molecules may not always be synthetically feasible.
  • Dataset Scope: The model's knowledge is limited to the chemical space represented in the MOSES dataset.

Training and Evaluation Data

The model was trained on the MOSES (MOlecular SEtS) dataset, a benchmark dataset for molecular generation. The dataset was converted from SMILES to the SAFE format to enhance molecular representation for machine learning tasks.

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

  • Learning Rate: 0.0005
  • Training Batch Size: 32
  • Evaluation Batch Size: 32
  • Seed: 42
  • Gradient Accumulation Steps: 2
  • Total Training Batch Size: 64
  • Optimizer: Adam (betas=(0.9, 0.999), epsilon=1e-08)
  • Learning Rate Scheduler: Linear with 20,000 warmup steps
  • Number of Epochs: 10

Training Results

Training Loss Epoch Step Validation Loss
1.1548 0.0407 1000 1.0531
0.8384 0.0813 2000 0.7846
0.7327 0.1220 3000 0.6928
0.6825 0.1626 4000 0.6570
0.6468 0.2033 5000 0.6206
0.6235 0.2440 6000 0.5964
0.6063 0.2846 7000 0.5838
0.5904 0.3253 8000 0.5679
0.5791 0.3660 9000 0.5593
0.5699 0.4066 10000 0.5527
0.5641 0.4473 11000 0.5441
0.5537 0.4879 12000 0.5399
0.5518 0.5286 13000 0.5355
0.5501 0.5693 14000 0.5353
0.542 0.6099 15000 0.5278
0.5422 0.6506 16000 0.5263
0.5367 0.6912 17000 0.5239
0.5366 0.7319 18000 0.5206
0.5339 0.7726 19000 0.5206
0.5349 0.8132 20000 0.5160
0.5248 0.8539 21000 0.5158
0.5221 0.8945 22000 0.5082
0.5172 0.9352 23000 0.5077
0.5122 0.9759 24000 0.5030
0.5094 1.0165 25000 0.5002
0.507 1.0572 26000 0.4983
0.508 1.0979 27000 0.4935
0.5041 1.1385 28000 0.4934
0.502 1.1792 29000 0.4920
0.5021 1.2198 30000 0.4888
0.5005 1.2605 31000 0.4882
0.4973 1.3012 32000 0.4876
0.4954 1.3418 33000 0.4859
0.4914 1.3825 34000 0.4843
0.4946 1.4231 35000 0.4837
0.4908 1.4638 36000 0.4810
0.4905 1.5045 37000 0.4806
0.4881 1.5451 38000 0.4791
0.4868 1.5858 39000 0.4780
0.4896 1.6264 40000 0.4777
0.484 1.6671 41000 0.4774
0.4855 1.7078 42000 0.4742
0.4837 1.7484 43000 0.4742
0.4874 1.7891 44000 0.4743
0.4817 1.8298 45000 0.4727
0.4811 1.8704 46000 0.4732
0.4801 1.9111 47000 0.4713
0.4808 1.9517 48000 0.4710
0.4797 1.9924 49000 0.4703
0.4765 2.0331 50000 0.4697
0.4762 2.0737 51000 0.4684
0.4776 2.1144 52000 0.4682
0.4744 2.1550 53000 0.4691
0.4756 2.1957 54000 0.4674
0.4741 2.2364 55000 0.4661
0.4746 2.2770 56000 0.4669
0.4726 2.3177 57000 0.4660
0.4716 2.3583 58000 0.4647
0.4718 2.3990 59000 0.4648
0.4711 2.4397 60000 0.4638
0.4718 2.4803 61000 0.4643
0.4699 2.5210 62000 0.4631
0.4706 2.5617 63000 0.4622
0.473 2.6023 64000 0.4623
0.4671 2.6430 65000 0.4613
0.4677 2.6836 66000 0.4621
0.4681 2.7243 67000 0.4609
0.4718 2.7650 68000 0.4600
0.4649 2.8056 69000 0.4598
0.4659 2.8463 70000 0.4596
0.4661 2.8869 71000 0.4589
0.4651 2.9276 72000 0.4586
0.4659 2.9683 73000 0.4581
0.4629 3.0089 74000 0.4580
0.4631 3.0496 75000 0.4589
0.4638 3.0902 76000 0.4574
0.4623 3.1309 77000 0.4566
0.4631 3.1716 78000 0.4565
0.4633 3.2122 79000 0.4557
0.4609 3.2529 80000 0.4549
0.4616 3.2936 81000 0.4546
0.4613 3.3342 82000 0.4557
0.4602 3.3749 83000 0.4544
0.4612 3.4155 84000 0.4550
0.4588 3.4562 85000 0.4532
0.4602 3.4969 86000 0.4531
0.459 3.5375 87000 0.4537
0.4598 3.5782 88000 0.4528
0.4606 3.6188 89000 0.4530
0.4614 3.6595 90000 0.4523
0.4575 3.7002 91000 0.4515
0.4601 3.7408 92000 0.4517
0.4578 3.7815 93000 0.4517
0.4573 3.8221 94000 0.4507
0.457 3.8628 95000 0.4508
0.4596 3.9035 96000 0.4507
0.4566 3.9441 97000 0.4498
0.4571 3.9848 98000 0.4491
0.4529 4.0255 99000 0.4504
0.4515 4.0661 100000 0.4496
0.4525 4.1068 101000 0.4492
0.4534 4.1474 102000 0.4489
0.4533 4.1881 103000 0.4484
0.4544 4.2288 104000 0.4471
0.4524 4.2694 105000 0.4473
0.4524 4.3101 106000 0.4478
0.4535 4.3507 107000 0.4462
0.4531 4.3914 108000 0.4463
0.452 4.4321 109000 0.4467
0.4535 4.4727 110000 0.4460
0.4523 4.5134 111000 0.4459
0.4512 4.5540 112000 0.4454
0.4487 4.5947 113000 0.4454
0.4503 4.6354 114000 0.4453
0.4528 4.6760 115000 0.4444
0.4482 4.7167 116000 0.4444
0.4508 4.7574 117000 0.4435
0.4517 4.7980 118000 0.4438
0.4484 4.8387 119000 0.4441
0.4509 4.8793 120000 0.4437
0.4485 4.9200 121000 0.4429
0.4507 4.9607 122000 0.4428
0.4462 5.0013 123000 0.4424
0.4469 5.0420 124000 0.4419
0.4454 5.0826 125000 0.4421
0.4478 5.1233 126000 0.4413
0.445 5.1640 127000 0.4413
0.4456 5.2046 128000 0.4404
0.4447 5.2453 129000 0.4405
0.4451 5.2859 130000 0.4405
0.4464 5.3266 131000 0.4411
0.4441 5.3673 132000 0.4392
0.4446 5.4079 133000 0.4405
0.4427 5.4486 134000 0.4391
0.4431 5.4893 135000 0.4390
0.4469 5.5299 136000 0.4391
0.4421 5.5706 137000 0.4387
0.4444 5.6112 138000 0.4378
0.4431 5.6519 139000 0.4374
0.4422 5.6926 140000 0.4369
0.4409 5.7332 141000 0.4373
0.444 5.7739 142000 0.4368
0.4423 5.8145 143000 0.4376
0.4418 5.8552 144000 0.4370
0.4409 5.8959 145000 0.4352
0.4416 5.9365 146000 0.4358
0.44 5.9772 147000 0.4357
0.437 6.0179 148000 0.4347
0.4355 6.0585 149000 0.4350
0.4371 6.0992 150000 0.4346
0.4364 6.1398 151000 0.4350
0.4365 6.1805 152000 0.4336
0.4374 6.2212 153000 0.4336
0.4354 6.2618 154000 0.4335
0.4364 6.3025 155000 0.4335
0.436 6.3431 156000 0.4327
0.4365 6.3838 157000 0.4332
0.4368 6.4245 158000 0.4320
0.4363 6.4651 159000 0.4317
0.4367 6.5058 160000 0.4320
0.436 6.5464 161000 0.4316
0.4351 6.5871 162000 0.4317
0.436 6.6278 163000 0.4310
0.4334 6.6684 164000 0.4307
0.4348 6.7091 165000 0.4301
0.4357 6.7498 166000 0.4293
0.4327 6.7904 167000 0.4295
0.4348 6.8311 168000 0.4294
0.4323 6.8717 169000 0.4284
0.4334 6.9124 170000 0.4283
0.4317 6.9531 171000 0.4279
0.433 6.9937 172000 0.4284
0.4273 7.0344 173000 0.4279
0.4272 7.0750 174000 0.4275
0.4265 7.1157 175000 0.4269
0.4287 7.1564 176000 0.4268
0.4282 7.1970 177000 0.4264
0.4267 7.2377 178000 0.4267
0.4271 7.2783 179000 0.4256
0.4282 7.3190 180000 0.4254
0.427 7.3597 181000 0.4251
0.4262 7.4003 182000 0.4249
0.4272 7.4410 183000 0.4248
0.4271 7.4817 184000 0.4243
0.4261 7.5223 185000 0.4236
0.4273 7.5630 186000 0.4237
0.4262 7.6036 187000 0.4238
0.426 7.6443 188000 0.4232
0.4243 7.6850 189000 0.4226
0.4242 7.7256 190000 0.4219
0.427 7.7663 191000 0.4215
0.4236 7.8069 192000 0.4211
0.422 7.8476 193000 0.4211
0.4224 7.8883 194000 0.4204
0.4237 7.9289 195000 0.4201
0.424 7.9696 196000 0.4200
0.4161 8.0102 197000 0.4196
0.4172 8.0509 198000 0.4193
0.4165 8.0916 199000 0.4192
0.4151 8.1322 200000 0.4189
0.417 8.1729 201000 0.4184
0.4172 8.2136 202000 0.4182
0.4181 8.2542 203000 0.4180
0.4167 8.2949 204000 0.4170
0.4184 8.3355 205000 0.4168
0.4148 8.3762 206000 0.4164
0.4171 8.4169 207000 0.4157
0.417 8.4575 208000 0.4158
0.4174 8.4982 209000 0.4153
0.4159 8.5388 210000 0.4149
0.4141 8.5795 211000 0.4149
0.4141 8.6202 212000 0.4144
0.4121 8.6608 213000 0.4139
0.4134 8.7015 214000 0.4133
0.4126 8.7421 215000 0.4135
0.4141 8.7828 216000 0.4125
0.4126 8.8235 217000 0.4125
0.4117 8.8641 218000 0.4119
0.4114 8.9048 219000 0.4115
0.4102 8.9455 220000 0.4113
0.4123 8.9861 221000 0.4103
0.4045 9.0268 222000 0.4104
0.4039 9.0674 223000 0.4104
0.4042 9.1081 224000 0.4100
0.4063 9.1488 225000 0.4092
0.4045 9.1894 226000 0.4091
0.4052 9.2301 227000 0.4086
0.4041 9.2707 228000 0.4082
0.4042 9.3114 229000 0.4077
0.403 9.3521 230000 0.4077
0.4047 9.3927 231000 0.4070
0.4014 9.4334 232000 0.4067
0.4032 9.4740 233000 0.4062
0.4018 9.5147 234000 0.4059
0.4015 9.5554 235000 0.4054
0.4005 9.5960 236000 0.4050
0.4016 9.6367 237000 0.4049
0.4012 9.6774 238000 0.4043
0.4014 9.7180 239000 0.4040
0.3995 9.7587 240000 0.4037
0.398 9.7993 241000 0.4035
0.3979 9.8400 242000 0.4032
0.3965 9.8807 243000 0.4029
0.3983 9.9213 244000 0.4026
0.3997 9.9620 245000 0.4025

Framework Versions

  • Transformers: 4.43.3
  • PyTorch: 2.4.0+cu121
  • Datasets: 2.20.0
  • Tokenizers: 0.19.1

Acknowledgements

We acknowledge and thank the authors of the SAFE framework for their valuable contribution to the field of molecular design.

References

@inproceedings{
  lombard2024molecular,
  title={Molecular Generation with State Space Sequence Models},
  author={Anri Lombard and Shane Acton and Ulrich Armel Mbou Sob and Jan Buys},
  booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
  year={2024},
  url={https://openreview.net/forum?id=1ib5oyTQIb}
}
@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}
Downloads last month
1
Safetensors
Model size
20.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train anrilombard/safe-20m